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Abstract 


As  modem  digital  circuits  grow  larger  and  more  complex,  the  time  required  to  perform 
sequential  simulations  becomes  unacceptably  slow.  Since  simulation  is  a  vital  input  to  the  design 
and  validation  of  circuits,  this  bottleneck  affects  the  efficiency  of  the  entire  development  cycle. 
Parallel  simulation  offers  a  solution  that  scales  with  the  problem.  By  assigning  circuit  components 
to  distributed  processors,  the  work  of  the  simulation  can  be  divided.  There  is,  however,  an  addi¬ 
tional  cost  of  synchronization  between  the  cooperative  processors  not  present  in  sequential  simula¬ 
tion.  The  manner  in  which  circuit  components  are  partitioned  among  processors  greatly  influences 
the  amount  of  overhead  incurred.  The  task  is  to  partition  intelligently  such  that  computational 
parallelism  is  not  overwhelmed  by  synchronization  overhead. 

In  this  research  effort  heuristic  techniques  of  intelligent  partitioning  were  considered.  By 
observing  trends  of  successful  partitions,  a  statistical  relationship  of  a  priori,  graph-based  parame¬ 
ters  was  developed  with  parallel  simulation  runtime.  Formal  definition  of  this  relationship  in  the 
form  of  a  cost  model  allowed  allocations  to  be  ordered  by  predicted  runtime.  By  choosing  the  allo¬ 
cation  with  the  lowest  cost  model  value,  the  simulation  using  that  allocation  was  expected  to  have 
the  lowest  runtime  of  the  set  considered.  “The  set  considered”  is  an  important  distinction  because 
the  mapping  of  tasks  to  processors  to  achieve  the  lowest  runtime  is  a  known  NP-Complete  prob¬ 
lem.  Finding  an  optimal  solution  is  intractable;  finding  relatively  good  solutions  is  desirable.  The 
set  of  candidate  allocations  is  chosen  by  using  Kapp’s  AB  Improvement  search  procedure  (Kapp, 
1993:84).  Unfortunately,  both  current  and  previous  cost  models  failed  to  achieve  significant  sta¬ 
tistical  correlation  with  runtime. 

Improvement  was  achieved  through  controlling  feedback.  In  example  circuits,  previous 
partitioning  techniques  induced  feedback  among  processors.  By  eliminating  this,  better  than  linear 
speedup  was  achieved. 
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ACCELERATING 


CONSERVATIVE  PARALLEL  SIMULATION 
OF  VHDL  CIRCUITS 


1.  Introduction 


1.1  Background 

Simulation  provides  fundamental  analysis  capabilities  for  many  theaters  of  research  and 
industry.  Similar  to  static  models,  simulations  provide  the  ability  to  examine  target  systems  at  an 
economy  of  time  and  resources.  Furthermore,  simulations  provide  an  ability  to  examine  systems 
impossible  to  construct,  for  example  weather  patterns  or  sub-atomic  particles.  Simulations  exceed 
static  models  by  adding  the  ability  to  examine  the  dynamics  of  a  system,  not  just  the  state  of  its 
components.  Because  of  these  benefits,  computer  simulation  is  a  basic  tool  of  defense,  weather, 
biomedical,  chemical,  financial,  electronic,  and  many  other  industries. 

A  repeating  theme  for  the  computer  is  its  inability  to  keep  up  with  user  requirements. 

Each  developmental  leap  of  processing  power  is  met  by  a  larger  need  for  computational  resources. 
For  simulations,  researchers  seek  finer  resolution  of  simulation,  or  consumption  of  larger  datasets, 
or  both.  In  the  case  of  digital  circuit  design,  chip  transistor  counts  increase  by  25%  per  year, 
doubling  every  three  years  (Hennessy  and  Patterson,  1990:17).  For  all  the  marvels  of  processor 
development,  the  sequential  system  creates  an  inherent  bottleneck  that  will  be  realized  with  a  suffi¬ 
ciently  large  problem  or  sufficiently  small  timing  constraint.  For  circuit  design,  slow  simulation 
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lengthens  the  design  cycle  and  increases  the  cost  of  the  final  product  (Kapp,  1993:1).  Assuming 
that  computer  users  will  continue  to  require  more  than  computer  designers  can  fit  into  a  single  data 
processor,  a  solution  is  to  enlist  more  computers.  Parallel  processing  is  an  alternative  that  may 
prevent  simulation  from  being  the  limiting  phase  of  the  research  or  development  cycle.  Immedi¬ 
ately,  parallel  simulation  offers  the  ability  to  process  larger  datasets,  perhaps  in  less  time.  More 
importantly  (if  scalable1)  a  large  simulation  can  be  accommodated  by  increasing  the  number  of 
processors  regardless  of  the  size  of  the  dataset.  Or,  by  utilizing  more  processors,  a  solution  to  a 
defined  problem  can  be  found  within  some  requisite  timespan. 

This  paper  results  from  the  Advanced  Research  Projects  Agency’s  (ARPA)  desire  to 
achieve  performance  improvement  for  the  simulation  of  VHSIC  Hardware  Description  Language 
(VHDL)  circuits.  ARPA  sponsors  the  QUEST  project  with  the  objective  of  a  thousand-fold 
speedup  in  large  VHDL  simulations  (Kapp,  1993: 1).  Reasearchers  at  the  Air  Force  Institute  of 
Technology  (AFIT)  have  been  investigating  conservative  parallel  simulation  of  VHDL  circuits  for 
several  years.  In  1992  Breeden  (Breeden,  1992)  demonstrated  speedup  for  a  random  allocation  of 
VHDL  behaviors2  to  processors  (See  Figure  1).  In  1993,  Kapp  showed  the  benefit  of  more  intelli¬ 
gent  partitioning  strategies  and  the  feasibility  of  iterative  improvement  of  initial  partitions  using  his 
AB  Border  Improvement  process  (Kapp,  1993:84).  This  study  continues  ART  research  by  further 
exploring  the  potential  of  intelligent  partitioning  to  increase  speedup3  of  parallel  VHDL  simulation. 


Scalability  is  the  ability  to  solve  a  problem  for  an  increasing  number  of  processors.  Scalability  is  consid¬ 
ered  with  regard  to  size,  time,  and  memory  consumption. 

2  A  behavior  is  an  executable  VHDL  process  representing  a  logic  gate,  source  signal,  or  other  simple 
VHDL  process  (Kapp,  1993:2). 

3 Speedup  is  the  ratio  of  single  processor  runtime  /  parallel  runtime. 
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Biased  Speedup 


Figure  1  Speedup  Curves  for  Wallace  Tree  (Kapp,  1993) 


If  the  simulation  of  a  VHDL  circuit  is  the  end  problem,  partitioning  that  problem  for  paral¬ 
lel  execution  is  a  preceding  step  of  significant  weight  Specifically,  the  allocation  of  N  precedence 
constrained  tasks  to  P  processors  to  achieve  minimum  runtime  is  referred  to  as  a  version  of  the 
Mapping  Problem  and  is  NP-Complete  (Sartor,  1991:1-4).  Fundamentally,  an  NP-Complete 
problem  has  many  potential  solutions  without  a  polynomial  way  to  find  an  optimal  in  that  solution 
space.  Thus,  it  is  unreasonable  to  seek  an  optimal  solution  for  problems  of  large  size  (greater  than 
100  nodes  to  be  allocated).  Actually,  problems  of  relatively  small  size  are  also  unreasonable. 

Sartor  gives  the  example  of  60  nodes  allocated  to  2  processors.  An  exhaustive  search  of  the  60! 
possible  combinations  would  take  2.63  x  1066  centuries  if  a  processor  could  consider  1,000,000 
combinations  a  second.  The  circuits  worthy  of  consideration  for  parallel  simulation  have  more 
than  1000  nodes.  Inherently,  one  is  limited  to  a  sub-optimal  approximation  of  the  best  solution 
locatable  in  some  reasonable  amount  of  time.  “Reasonable”  need  not  be  polynomial,  but  it  does 
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need  to  be  amortizable  over  the  number  of  simulation  runs  of  the  applicable  circuit  (Nandy  and 
Loucks,  1993:46). 

Furthermore,  the  goal  of  decreasing  simulation  runtime4  cannot  be  measured  without  run¬ 
ning  the  simulation.  Ideally,  one  could  determine  the  best  partition  using  a  priori  parameters 
(measurable  before  the  simulation).  “However,  it  is  often  difficult  to  relate  this  goal  (minimum 
runtime)  to  a  usable  parameter  in  the  partitioning  process”  (Nandy  and  Loucks,  1993:43).  The 
task  evolves  to  find  a  priori  indicators  or  a  combination  of  indicators  that  correlate  to  a  good  run¬ 
time.  Further  complicating  the  task  at  hand  is  the  number  of  parameters  significant  to  simulation 
runtime.  Candidate  parameters  include: 

•  number  of  behaviors 

•  number  of  interconnections 

•  number  of  dependencies  crossing  processor  boundaries 

•  imbalance  of  load  (number  of  behaviors)  allocated  to  processors 

•  minimum  path  length  of  subgraphs  allocated  to  each  processor 

•  granularity5  of  host  machine 

•  granularity  of  application 

•  fanout  of  behaviors 

•  level  of  feedback 

•  frequency  of  signal  change 

•  efficiency  of  event  list 

•  utilization  of  resources 

•  task  switching  capabilities  of  host  operating  system 

•  test  bench  size  and  nature 

•  many,  many  more 


4In  a  parallel  execution,  runtime  is  the  max[fmish  time  of  all  processors]  -  min[start  time  of  all  proces¬ 
sors]. 

5Hardware  granularity  refers  to  the  computational  capabilities  of  each  processor  versus  the  ability  to  com¬ 
municate  among  processors.  When  computational  and  communication  capabilities  are  equal,  the  machine 
is  labeled  fine-grain.  Usually  computational  speed  exceeds  communication  throughput  making  the  ma¬ 
chine  tend  towards  coarse-grain.  Software  granularity  is  the  ratio  of  the  computational  demands  versus 
communication  demands.  Similar  to  hardware,  low  computational  workload  between  communications 
indicates  a  fine-grain  application.  Greater  computation  to  communication  ratio  indicates  a  coarse-grain 
application. 
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In  order  to  make  an  analysis  tractable,  assumptions  and  scope  must  be  used  to  focus  the  study  and 
likely  limit  the  general  applicability  of  results  (Bailey  and  Pagels,  1991:627). 

Despite  the  daunting  number  of  complications  in  parallel  VHDL  simulation,  it  is  a  greatly 
simpler  case  than  general  simulation  since  logic  circuits  have  a  static  dependency  graph.  A  logic 
gate  is  exactly  wired  at  the  start  of  a  simulation  as  it  is  at  the  finish.  Non-deterministic  simulations 
allow  actors  to  dynamically  interact  and  consequently  change  lines  of  communication.  Hopefully, 
conclusions  from  this  research  will  find  applicability  in  more  general  simulation  cases. 

1.2  Research  Objectives 

As  alluded  to  in  the  previous  section,  the  goal  of  this  research  is  to  speed  conservative 
simulation  of  VHDL  circuits.  Pursuit  of  this  goal  includes  graph  based  analysis  of  the  subject  cir¬ 
cuit  and  statistical  correlation  of  sub-graph  parameters  to  runtime  in  order  to  achieve  iterative  im¬ 
provement  of  a  “first-cut”  allocation.  A  partition  is  a  component,  or  subgraph,  of  the  subject  cir¬ 
cuit  allocated  to  a  single  processor.  An  allocation  is  the  set  of  partitions  that  make  up  a  subject 
circuit  and  is  the  end  result  of  the  partitioning  process  including  the  mapping  relation  of  partition  to 
processor.  Iterative  improvement  investigates  moving  candidate  nodes  from  one  partition  to  an¬ 
other  assessing  benefit  based  on  a  representative  cost  model.  Small  cost  model  results  should  cor¬ 
relate  to  comparitively  small  observed  runtime  values.  Specific  objectives  are: 

•  Determine  graph  based  partitioning  strategies  significant  to  the  speed  of  VHDL  simulations. 

•  Use  statistical  analysis  to  identify  allocation  parameters  consequential  to  runtime. 

•  Form  a  cost  equation  that  assesses  the  cost  of  an  allocation  and  correlates  to  the  runtime  for 
that  allocation.  The  cost  equation  should  at  least  define  a  partial  ordering  of  allocations. 

•  Demonstrate  improvement  of  a  statistically  derived  cost  model  over  Kapp’s  theoretically  de¬ 
termined  model. 
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Questions  to  address  in  accomplishing  the  above  objectives  include: 

•  How  general  is  the  statistical  cost  equation?  Is  it  circuit  specific?  Is  it  host  architecture  spe¬ 
cific? 

•  How  well  does  the  statistical  model  do?  How  does  it  compare  to  other  simple  strategies 
(random,  breadth-first,  depth- first)  and  the  Kapp  Cost  Model? 

•  How  reliable  is  the  statistical  cost  model?  How  frequently  does  it  demonstrate  benefit  over  the 
Kapp  Cost  Model?  How  accurately  does  it  order  allocations? 

1.3  Assumptions 

The  toolset  used  in  this  research  is  based  on  the  commercial  Intermetrics  VHDL  compiler 
(Intermetrics  VHDL  Compiler,  1990).  Previous  AHT  researchers,  Comeau  and  Breeden,  devel¬ 
oped  VSIM  which  translates  Intermetrics  compiler  output  into  parallel  source  code  which  can  then 
be  used  to  run  on  the  host  parallel  computer  (Comeau,  1991).  VSIM  requires  SPECTRUM  which 
provides  the  interprocessor  communication  services  and  implements  a  version  of  Chandy-Misra’s 
null  protocol  (Chandy  and  Misra,  1988).  VSIM  has  been  executed  on  Intel  iPSC/2,  iPSC/860,  and 
Paragon  architectures.  This  report  exclusively  collects  data  on  the  iPSC/2.  The  final  tool  of  sig¬ 
nificance  is  the  Graph  Partitioning  Tool  (GPT).  Modified  to  verion  3.0,  this  software  creates  the 
necessary  mapping  files  and  delay  information  deduced  from  an  allocation.  GPT  performs  initial 
and  iteratively  improved  allocations.  There  is  a  fundamental  assumption  that  results  from  this 
toolset  are  applicable  to  other  simulation  environments. 

A  biased  speedup  metric  is  used  to  evaluate  parallel  circuit  performance.  Speedup  is 
therefore  overstated  by  using  a  non-optimized  sequential  simulation  runtime  in  the  numerator 
(Wieland,  1990:1).  However,  statistics  gathered  on  single  processor  execution  of  VSIM  demon- 
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strate  negligible  execution  of  overhead  routines  which  supports  the  assumption  that  biased  speedup 
is  representative  of  true  speedup. 

Modifications  to  the  VSIM  tool  may  affect  runtimes  of  research  conducted  in  this  thesis. 
Thus,  comparison  to  previous  results  will  be  accomplished  via  speedup.  It  is  assumed  that  tool 
specific  differences  will  be  isolated  allowing  assessment  of  the  statistical  cost  model. 

Also  presupposed  is  that  observed  data  follows  a  linear  relationship  in  the  space  defined  by 
currently  measured  parameters.  If  runtime  is  very  different  than  linear  or  pertinent  parameters 
have  not  been  included,  the  model  will  likely  prove  unacceptable  except  for  the  specific  cases  com¬ 
posing  the  sample  set  on  which  the  model  was  derived. 

1.4  Scope 

As  stated,  the  Mapping  Problem  is  NP-Complete  making  an  optimal  allocation  unlikely 
and  unverifiable.  This  research  seeks  to  achieve  “good”  allocations  assessed  by  comparison  with 
random  and  best-to-date  speedup  measures. 

The  focus  is  on  the  feasibility  of  effective  a  priori  partitioning.  Naturally  the  efficiency 
with  which  partitioning  is  accomplished  is  basic  to  the  merit  of  that  strategy.  However,  efficiency 
is  not  considered  here  under  the  assumption  that  once  demonstrated  as  effective,  additional  re¬ 
search  can  be  spent  to  make  the  allocation  process  efficient. 

Only  Chandy-Misra  null  protocol  simulation  of  VHDL  circuits  is  considered.  Also,  only 
VHDL  simulation  is  studied.  This  reduces  the  scope  to  the  static  dependency  subset  of  the  general 
simulation  problem. 

In  developing  the  cost  equation,  gates  are  assumed  to  be  uniformly  active.  The  test  vectors 
used  are  general  and  do  not  seek  to  isolate  circuit  subcomponents  nor  boundary  conditions.  Addi¬ 
tionally,  despite  the  fact  that  runtime  is  a  maximum  measurement  for  some  single  processor  of  the 
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simulation,  data  collected  and  analyzed  is  based  on  cumulative  statistics  across  processors.  It  is 
assumed  that  individual  processor  performance  is  close  to  mean  performance.  This  global  view  of 


the  circuit  allocation  allows  simpler  analysis.  Partitions  are,  therefore,  constructed  in  a  balanced 
manner  to  avoid  deviation  from  this  assumption.  Consequently,  potential  advantages  of  non- 
uniform  allocations  are  not  investigated  and  specific  data  pertinent  to  the  LP  driving  the  maximum 
runtime  is  not  reported. 

Single  circuit,  multiple  allocation  ordering  is  pursued.  It  is  rarely  of  interest  to  compare 
runtimes  of  two  different  circuits,  only  the  runtimes  of  different  allocations  of  the  same  circuit. 
Model  development  occurs  for  a  single  circuit.  The  general  applicability  of  that  model  is  consid¬ 
ered  separately. 

1.5  Overview 

Chapter  2  presents  a  current  literature  survey  in  simulation  prediction  and  modeling  in¬ 
cluding  results  from  previous  AFIT  research.  After  establishing  the  basics  of  conservative  parallel 
simulation  and  terms  used  in  this  paper,  alternative  cost  models  and  significant  parameters  to  run¬ 
time  are  presented.  Chapter  3  discusses  the  methodology  and  practices  followed  in  this  study.  The 
statistical  analysis  technique  applied  throughout  is  presented  in  detail.  Other  issues,  including 
further  detail  of  the  tools  used  and  specifics  of  the  simulation  environment  are  also  furnished.  Im¬ 
plementation  specifics  and  data  results  can  be  found  in  Chapter  4.  Chapter  5  presents  the  conclu¬ 
sions  offered  in  this  paper  and  recommends  avenues  of  further  study. 
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1.6  Summary 


The  simulation  of  VHDL  circuits  provides  required  insight  for  computer  designers  and  is  a 
necessary  part  of  the  circuit  development  cycle.  Sequential  simulation  is  inherently  limited  in  that 
a  problem  can  grow  large  enough  to  exceed  the  resources  of  that  sequential  system  or  violate  con¬ 
straints  on  execution  time.  Parallelization  offers  hope  of  a  scaleable  solution  such  that  no  matter 
what  the  problem  size  (or  perhaps  timing  constraint),  effective  and  efficient  simulation  can  take 
place  through  the  application  of  additional  processors.  Mapping  is  the  NP-Complete  problem  that 
is  coupled  with  determining  an  allocation.  It  seeks  to  partition  a  problem  into  subcomponents  and 
to  allocate  each  subcomponent  to  a  processor  to  achieve  some  optimal  condition.  In  this  case, 
atomic  behaviors  of  the  overall  circuit  are  grouped  into  partitions  and  allocated  to  processors  to 
achieve  short  runtime.  Since  the  mapping  problem  is  so  difficult,  an  optimum  solution  cannot  be 
found  within  reasonable  time  for  large  circuits.  This  research  uses  statistical  analysis  of  graph 
parameters  to  find  good  runtime  performance.  Effective  models  for  simulation  performance  are 
still  immature,  “...analytic  modeling  of  parallel  simulation  is  an  art,  for  which  I  hope  to  shed  light 
(Nicol,  1991:30).” 
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2.  Literature  Review 


2.1  Conservative  Simulation 

In  order  to  proceed  with  the  specifics  of  this  research,  a  brief  presentation  of  the  actors  and 
dynamics  of  conservative  simulation  is  required.  Fujimoto  describes  parallel  discrete  event  simu¬ 
lation  in  his  excellent  article  (Fujimoto,  1989).  There  are  three  major  actors  in  a  sequential  dis- 


Event  List  Clock 


Figure  2  Basic  Cycle  of  Discrete  Event  Simulation 


crete  event  simulation  as  portrayed  in  Figure  2.  A  discrete  event  simulation  proceeds  by  jumping 
from  significant  event  to  significant  event  as  presented  by  the  event  list.  Events  are  supplied  for 
processing  in  monotonically  increasing  order  of  timestamp  which  represents  the  simulation  time  at 
which  the  event  should  occur.  Processing  an  event  requires  updating  the  clock  which  tracks  the 
progression  of  the  simulation.  Additionally,  processing  an  event  interacts  with  the  state  variables 
of  the  specific  system  being  simulated.  For  VHDL  simulation,  state  variables  are  specific  seg- 
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ments  of  wire,  or  signals,  which  can  be  logical  0  or  1.  Processing  an  event  may  cause  future  events 
to  be  scheduled  as  in  the  propagation  of  a  signal  change  through  affected  gates.  These  new  events 
are  submitted  to  the  event  list  which  will  hold  them  until  all  earlier  events  have  been  released. 

Parallel  simulation  extends  this  basic  sequential  structure  to  multiple  processors.  As  de¬ 
picted  in  Figure  3,  each  processor  has  its  own  sequential  simulator  with  each  holding  a  disjoint 
view  of  the  state  variable  space.  In  general  PDES  (parallel  discrete  event  simulation)  strategies 
avoid  coherency  issues  by  strictly  forbidding  direct  access  to  shared  state  variables  (Fujimoto, 
1989:3).  The  collection  of  simulation  structures  and  unique  view  of  the  state  variable  space  de¬ 


fines  a  Logical  Process  (LP).  The  (J  LP{  for  all  i  defines  the  state  of  the  entire  simulation 

(Fujimoto,  1989:4).  Upon  encountering  an  event  that  affects  the  state  variable  space  of  another 
LP,  the  source  LP  sends  a  timestamped  event  to  the  appropriate  destination  LP.  Communication 
between  LPs  requires  additional  data  structures.  These  structures  may  buffer  or  somehow  priori¬ 
tize  message  traffic,  but  the  fundamental  service  required  is  to  report  to  the  owning  LP  the  times- 
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tamp  information  of  each  event  received  (or  sent)  on  the  communication  channel1.  In  addition  to 
the  communication  data  structures,  some  protocol  must  manage  the  presentation  of  remote  mes¬ 
sages  to  the  local  sequential  simulator  as  well  as  the  formatting  of  events  destined  for  other  LPs.  It 
is  the  responsibility  of  the  protocol  to  preserve  causality. 

Critical  to  any  simulation  is  the  concept  of  causality.  A  correct  simulation  does  not  allow 
causally  related  events  to  proceed  out  of  order.  If  A  affects  B,  A  must  be  processed  before  B.  A 
causality  error  occurs  when  events  are  executed  out  of  order  and  amounts  to  the  future  affecting 
the  past  (Fujimoto,  1989:2).  Two  schools  of  thought  prevail  in  discrete  event  simulation.  Opti¬ 
mistic  protocols  use  a  detection  and  recovery  method  to  identify  when  causality  errors  occur  and 
rollback  or  otherwise  correct  the  execution  mistake.  Conservative  simulation  uses  prevention  to 
avoid  causality  errors.  Conservative  protocols  force  each  LP  to  block  (suspend  simulation)  until  it 
can  be  guaranteed  that  no  messages  from  the  past  will  arrive  at  the  blocked  LP.  Chandy-Misra 
methods  guarantee  that  messages  sent  over  communication  channels  will  be  monotonically  increas¬ 
ing  by  order  of  timestamp.  Once  all  input  channels  have  received  a  message  equal  to  or  later  than 
an  LP’s  local  clock,  that  LP  is  guaranteed  to  be  able  to  proceed  with  simulation  without  risking  a 
causality  error.  Unfortunately,  the  act  of  blocking  introduces  the  possibility  of  deadlock.  Chandy- 
Misra’s  null  message  protocol  uses  messages  that  carry  only  a  timestamp.  The  sender  guarantees 
not  to  send  any  messages  earlier  than  the  null  message  timestamp.  Null  message  recipients  update 
their  input  channel  and  progress  even  though  the  sender  has  no  real  message  for  that  LP.  Chandy- 
Misra  requires  that  all  cycles  of  the  contracted  problem  graph2  carry  a  non-zero  minimum  delay 
and  that  nulls  be  sent: 


XA  channel  represents  a  uni-directional  line  of  communication  between  a  source  and  destination.  It  is  not 
a  structure,  but  represents  the  possibility  of  communication  between  the  two  actors. 

2The  contracted  problem  graph,  or  LP  graph,  is  the  graph  that  results  from  contracting  all  nodes  allocated 
to  an  LP  into  a  single  node. 
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•  on  all  other  output  channels  upon  sending  a  real  message  down  some  channel 

•  on  all  output  channels  upon  blocking 

•  on  all  output  channels  upon  receiving  a  null  message 

Optimizations  can  be  made  to  reduce  null  message  traffic,  but  the  above  conditions  guarantee  to 
prevent  deadlock.  While  deadlocks  are  prevented,  cyclic  dependencies  are  not.  The  propagation  of 
null  messages  through  a  cyclic  dependency  allows  all  members  to  increase  their  input  channel  times 
and,  consequently,  their  local  clocks.  The  cycle  in  effect  “winds  up”  to  allow  processing  to  con¬ 
tinue  on  each  participating  LP. 

2.2  Circuit  Simulation 

Circuit  simulation  is  appropriate  for  discrete  event  simulation  in  that  only  a  small  fraction 
of  logic  gates,  typically  0.1%,  change  output  values  on  a  single  clock  tick  (Soule  and  Gupta, 
1989:85).  Synchronous  or  continuous  simulation  would  leave  a  majority  of  processing  resources 
idle  in  any  time  window.  Soule  and  Gupta,  however,  state  that  this  small  activity  level  also  pre¬ 
cludes  a  null  message  protocol  from  being  efficient.  The  implication  is  that  real  messages  would 
numerically  trail  null  messages  by  a  similar  ratio.  Soule  and  Gupta  are  correct  if  using  classic 
Chandy-Misra  which  assumes  simple  LPs  and  a  large  number  of  processors.  The  situation 
changes  when  logic  gates  are  grouped  onto  LPs  exceeding  the  simple  processing  envisioned  in 
classic  Chandy-Misra.  Inter-LP  arcs  are  collapsed  into  channels  and  only  a  fraction  of  logic  gates 
on  the  LP  border  contribute  to  the  generation  of  null  messages. 
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Frequently,  when  developing  cost  models3  for  parallel  simulation  systems,  researchers  use 
queuing  systems  as  the  example  application  (e.g.  Fujimoto,  Nicol).  Circuit  simulation  differs  from 
queuing  simulation  in  that  queues  do  not  generate  new  events.  Every  input  event  from  a  source  in 
a  queuing  system  realizes  exactly  one  terminal  event  in  some  sink  of  that  system.  Circuit  simula¬ 
tion  differs  in  that  fanout,  cycles,  and  internal  generators  increase  the  event  population  of  a  circuit 
beyond  the  number  of  events  generated  from  source  nodes.  Similarly,  events  can  terminate  at  arbi¬ 
trary  vertices.  If  a  signal  change  on  the  input  of  a  logic  gate  does  not  change  the  output  of  that 
gate,  no  subsequent  event  will  be  scheduled.  This  makes  it  difficult  to  predict  the  population  of 
events  in  the  simulation  at  any  given  time  which  makes  it  more  difficult  to  predict  the  processing 
activity  of  any  circuit  subcomponent. 

2.3  Previous  AFIT Results 

2.3.1  Random  Partitioning. 

Breeden  explored  random  partitioning  which  allocated  behaviors  to  partitions  arbitrarily, 
seeking  only  to  maintain  a  balance  in  the  quantity  assigned  to  each  LP.  While  very  simple  and 
quick  to  execute,  this  approach  completely  ignores  the  cost  of  inter-LP  dependencies  which  are 
manifested  from  arcs4  between  behaviors  on  different  LPs.  The  results  in  Figure  1  and  other  tests 
show  that  random  partitioning  provides  speedup  in  limited  cases  and  does  not  scale  well  due  to 
overwhelming  communication  dependency  overhead. 


3  A  cost  model  is  a  mathematical  assessment  of  some  criterion  of  the  simulation  system.  In  this  research, 
that  criterion  is  runtime  and  the  parameters  of  the  mathematical  model  are  graph  based  measurements  of 
the  partitioned  circuit. 

4A  single  signal  between  two  behaviors  is  an  arc. 
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2.3.2  Simple  Data  Partitioning. 

Kapp  first  implemented  Simple  Data  Partitioning  (SDP)  in  his  Graph  Partitioning  Tool 
v2.0.  The  problem  graph  is  traversed  in  breadth-first  or  depth-first  manner  allocating  behaviors  to 
partitions  as  they  are  marked  as  found.  Upon  exceeding  some  threshold  value  for  LP  size,  behav¬ 
iors  are  allocated  to  the  next  LP.  SDP  algorithms  are  simple  and  quick  to  execute  while  yielding 
the  benefit  of  allocating  dependent  behaviors  to  the  same  LP.  Simple  Depth-First  (SDF)  groups 
paths  in  a  circuit  onto  an  LP.  This  promotes  larger  minimum  delays  through  LPs  and  around  cy¬ 
clic  LP  dependencies  allowing  null  messages  to  “wind  up”  at  larger  increments.  However,  as 
shown  in  Figure  4,  SDF  with  LP  load  balancing  may  result  in  fragmented  paths  which  counteract 
the  benefit  of  minimum  delay.  Simple  Breadth-First  (SBF)  also  succeeds  in  grouping  dependent 
behaviors  onto  the  same  LP.  Like  SDF,  SBF  does  not  sufficiently  counteract  communication 
overhead  particularly  as  the  number  of  processors  grows  large. 

2.3.3  Strongly  Connected  Components. 

Strongly  connected  components  (SCC)  are  a  basic  concept  of  graph  theory  and  represent 
the  largest  set  of  nodes  such  that  every  node  in  a  SCC  is  reachable  by  every  other  node.  There  can 
be  many  SCCs  in  a  particular  graph.  SCCs  can  also  be  defined  as  a  set  of  cycles.  The  SCC  is  the 
union  of  all  vertices  in  inclusive  cycles.  Kapp  and  others  observed  the  tightly  coupled  communica¬ 
tion  of  cycle  members  in  a  problem  graph.  Even  if  the  frequency  of  real  message  traffic  between 
members  is  low,  null  protocol  communication  will  have  to  occur  within  the  SCC  to  prevent  dead¬ 
lock.  Given  this  easy  way  to  identify  coupled  behaviors,  Kapp  enhanced  Simple  Data  Partitioning 
to  consider  SCCs  as  part  of  his  AB  Improvement  partitioning  technique  described  in  the  next  sec¬ 
tion.  Hie  idea  is  to  keep  SCC  members  on  the  same  LP  to  prevent  their  significant  contribution  to 
inter-LP  communication. 
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2.3.4  AB  Annealing. 

Despite  the  title,  Kapp’s  AB  Annealing  process  is  actually  an  iterative  improvement  tech¬ 
nique.  Kapp  performs  a  simple  data  partition  with  consideration  of  SCCs.  Upon  finding  a  node 
that  is  part  of  an  SCC,  all  members  of  the  SCC  are  marked  found  and  allocated  to  the  current  LP 
under  consideration.  From  this  simple,  efficient  initial  allocation,  Kapp  identifies  a  priority  queue 
of  candidate  vertices  for  movement  to  other  LPs.  A  candidate  vertex  is  one  that  has  more  arcs  to  a 
single,  external  LP  than  to  the  LP  to  which  it  is  currently  assigned.  A  candidate  vertex  v  has  a  pri¬ 
ority  calculated  as  follows: 

Priority(v)  =  Max_Extemal_Arcs(v)  -  Local_Arcs(v)  (Kapp,  1993:61) 

Each  candidate  vertex  is  considered  for  movement  to  all  other  LPs.  If  a  move  reduces  the  marginal 


approximation  of  communication  cost  and  does  not  violate  load  balance  tolerances,  then  the  vertex 


is  moved  and  the  allocation  data  structures  updated.  A  single  iteration  considers  all  candidate 
nodes.  Subsequently,  a  new  iteration  begins  by  re-identifying  candidate  vertices  and  repeating  the 
move  processing.  Iterations  continue  until  a  user-specified  number  or  until  iterations  yield  no  im¬ 
provement  of  the  marginal  approximation  of  communication  cost.  Kapp’s  entire  cost  model  is 

H  :=P-Hn  Hc  (l  +Hd)+a-Hb 
Equation  1  Kapp  Cost  Model  (Kapp,  1993:57) 

where 

Hn :  estimate  for  cost  of  null  message  traffic 
Hc :  estimate  of  cost  for  real  message  communication 
Hd :  estimate  of  effect  of  communication  imbalance 
Hb  :  estimate  of  effect  of  load  imbalance 
P  :  coefficient  for  communications  cost 
a  :  coefficient  for  processing  cost 

To  estimate  Hn,  Kapp  defines  Lookahead.  Lookahead  is  based  on  the  minimum  delays  in  travers¬ 
ing  LPs.  As  stated,  the  magnitude  of  the  minimum  delay  influences  the  oveihead  incurred  in 
“winding  up”  cyclic  dependencies  and  should  therefore  influence  the  number  of  nulls  transmitted. 
Kapp  defines  Hn  =  Lara.  •  where  is  Kapp’s  Lookahead  and  OarCs  is  the  number  of  com¬ 
munication  channels  in  the  LP  graph.  Hc  estimates  the  amount  of  communication  traffic  induced 
by  an  allocation.  Kapp  forms  a  communication  matrix  C  where  each  entry  is 

Cij  -  number  of  arcs  from  LPi  to  LP )  ■  weight  of  communication  from  LPt  to  LPj. 

Weight  of  communication  between  LPs  allows  consideration  of  the  number  of  hops  in  the  underly¬ 
ing  architecture,  available  bandwidth  on  hardware  communication  line,  etc.  Hc  is  then  found  by 
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2Xc# 

H.  = - - — 7 - - - .  Kapp  divides  by  the  number  of  arcs  to  normalize  this  metric  over 

total  number  of  arcs 

different  circuits.  The  distribution  of  communication  is  a  unique  term  that  attempts  to  assess  the 
delay  due  to  communication  imbalance  much  like  one  would  expect  a  delay  due  to  computational 
imbalance  between  LPs.  If  one  considers  the  maximum  row  sum  for  the  communication  matrix 
versus  the  average  row  sum  for  that  matrix,  a  measure  of  communication  imbalance  can  be 


formed.  Let  D.  =  X  C«  and  Hd  =  m—  —e 


D 


.  Finally,  Hb  adds  the  effect  of  computational 


avg 


load  imbalance.  By  summing  the  computational  weights  of  behaviors  allocated  in  a  partition,  a 
measure  for  computational  load  can  be  assessed  for  that  LP.  Actual  computational  work  also  de¬ 
pends  on  the  frequency  of  event  generation  for  each  behavior,  but  since  that  information  not  avail¬ 
able  a  priori;  only  the  static  weight  is  considered.  Similar  to  Hd,  if 


Vi  Workmsx  -Work 

Work ,  =  Zj  weight  of  behavior  then  Hb  = - — — - 

behavior^  Work, 


.  a  and  P  are  weighting 


avg 


parameters  to  balance  the  influence  of  communication  costs  versus  computation  cost  for  a  specific 
host  architecture.  For  the  Intel  iPSC/2,  Kapp  assumed  a  =  100.0  and  p  =  1.0.  Note  that  in  the 
AB  Annealing  process,  the  entire  cost  equation  is  not  evaluated  to  decide  the  movement  of  candi¬ 
date  vertices.  Kapp  instead  uses  Cost  =  Oarcs  Hc(l-Hd).  Summary  results  for  all  previous  AHT 
partitioning  techniques  on  the  Wallace  Tree  Multiplier  are  shown  in  Figure  1. 

AB  Annealing  shows  substantial  improvement  for  the  Wallace  Tree  at  8  nodes.  However, 
this  technique  is  suspect  for  several  reasons: 

1 .  Improvement  requires  finding  a  best  simple  partition.  Ultimate  speedup  depends  on  the  quality 
of  the  initial  partition.  Also,  since  this  is  a  greedy  technique,  local  minimums  will  limit  the 
benefit  of  the  iterative  search. 
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2.  Improvement  for  the  Associative  Memory  is  not  as  dramatic  as  for  the  Wallace  Tree  raising 
the  concern  of  general  applicability  of  the  model. 

3.  Iterative  improvement  is  based  on  a  partial  model  thus  losing  the  influence  of  some  terms.  If 
successful,  the  significance  of  those  missing  terms  becomes  questionable. 

4.  The  layers  of  the  model  hide  a  canceling  affect  between  Hc  and  Hd.  Hc  has  a  X  X  Q,  term  in 

*  j 

its  numerator  and  Hd  has  one  in  its  denominator. 

5.  The  estimates  for  a  and  (3  are  arbitrary  and  require  mathematical  basis. 

Despite  these  drawbacks,  however,  Kapp  successfully  demonstrated  iterative  improvement  of  run¬ 
time  via  graph  based  partitioning.  In  no  instances  did  his  model  cause  a  worsening  of  the  base  al¬ 
location  and  does  provide  an  efficient  way  to  potentially  get  better  allocations.  It  becomes  the  task 
of  this  follow  on  research  to  continue  the  successes  of  previous  AHT  research  and  address  its 
shortcomings. 

2.4  Hierarchical  Model 

Chamberlain  and  Franklin  studied  the  simulation  of  digital  circuits  on  a  hypercube  archi¬ 
tecture  (Chamberlain  and  Franklin,  1990).  Beginning  with  a  simple  cost  model,  they  considered 
and  expanded  terms  until  measurable  terms  became  available.  Upon  achieving  the  layer  of  meas¬ 
urable  detail,  they  used  their  model  to  predict  speedup.  Unfortunately,  the  protocol  used  to  manage 
the  parallel  simulation  was  based  on  a  global  synchronous  clock  and  therefore  cannot  be  directly 
applied  to  the  distinct  types  of  overhead  incurred  with  a  null  protocol.  However,  the  path  of  model 
development  is  applicable  to  any  protocol. 

Chamberlain’s  protocol  uses  a  master  processor  to  synchronize  processing.  Slave  proces¬ 
sors  perform  all  computation  of  the  simulation  under  the  direction  of  the  master.  The  master 
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transmits  a  time  to  all  slave  processors  which  then  simulate  up  to  that  global  time  (barrier)  ex¬ 
changing  data  with  other  slaves  as  necessary.  Upon  reaching  that  time,  each  slave  sends  a  message 
indicating  arrival  at  the  global  time  and  includes  the  time  of  its  next  event.  After  receiving  arrival 
messages  from  all  slaves,  the  master  determines  the  lowest  next  event  time  and  broadcasts  that  time 
to  the  slaves.  Also  fundamental  to  this  cost  model  is  the  imposed  assumption  of  hierarchical  circuit 
structure.  Circuits  are  built  from  components  which  are  in  turn  built  from  sub-components  which 
eventually  are  constructed  from  gates.  In  order  for  the  synchronous  protocol  to  be  efficient,  time 
steps  must  be  sufficiently  large  to  allow  the  processing  of  many  gate  level  events.  Thus,  instead  of 


Table  1  Hierarchical  Model  Terms  (Chamberlain  and  Franklin,  1990:11) 


Variable 

Type 

Definition 

R„ 

Output 

Simulation  runtime  using  p  processors 

C 

Input 

Number  of  system  components  to  be  simulated 

L 

Number  of  levels  in  hierarchical  system  description 

B 

Number  of  busy  ticks  in  the  simulation 

E 

Number  of  events 

F 

Number  of  functional  evaluations 

M„ 

Number  of  state  change  messages  when  P— 

ki 

Fraction  of  event/component  evaluations  at  level  l 

a 

Work  distribution  across  communications  links 

3 

Work  distribution  across  processors 

P 

Design/Hardware 

specific 

Number  of  processors 

fc 

Single  event  processing  time 

tFl 

Single  functional  evaluation  time  at  level  l 

tcF 

CPU  time  for  single  message  formulation 

tCT 

CPU  time  for  single  message  transmission 

tCR 

CPU  time  for  single  message  reception 

tLM 

Link  time  for  single  message  transmission 

try 

Link  time  for  single  message  protocol  overhead 

MP 

Abstract 

Number  of  state  change  messages  with  P  processors 

H 

Average  number  of  hops  required  per  message 

W 

Average  communications  width 

tcpu 

CPU  time  per  busy  tick 

tcOMM 

Communications  link  time  per  busy  tick 

tsYNC 

Synchronization  time  per  busy  tick 

tEVAL 

Event/functional  evaluation  time  per  busy  tick 

tF 

Average  single  functional  evaluation  time 

tcOMCPU 

Communications  overhead  for  CPU  per  busy  tick 
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allocating  individual  behaviors  to  LPs,  Chamberlain  and  Franklin  allocate  by  functional  unit. 
Event  times  are  based  on  functional  unit  delays,  not  gate  delays.  This  is  an  example  of  increasing 
the  granularity  of  the  problem.  Unfortunately,  increasing  the  granularity  decreases  the  potential 
degree  of  parallelism.  From  this  simulation  system,  Chamberlain  and  Franklin  build  their  model. 

If  the  simulation  executes  for  B  busy  ticks  ( a  busy  tick  is  a  clock  increment  in  which 
simulation  activity  occurs), 

RP  =  B-  [ma x(tCPU 

» *COMM  )  +  t SYNCH^ 

This  assumes  that  computation  and  communication  occur  concurrently  in  a  busy  tick.  CPU  time 
dedicated  to  message  formation  and  protocol  service  are  included  in  a  component  term  of  tcpu- 
tcoim  refers  to  the  time  spent  in  inter-slave  communication  for  cooperative  processing,  tsmc  is  the 
communication  with  the  master  for  global  synchronization.  The  CPU  time  used  over  all  processors 
per  busy  tick  can  be  broken  down  as  tCPU  =  +  tCOMCPU  where  tj.VAL  is  time  spent  performing 

actual  event/functional  evaluations  and  tCoucpv  is  CPU  time  spent  forming  and  executing  message 
protocol.  The  CPU  time  spent  evaluating  can  be  expressed  in  terms  of  the  time  to  perform  an  event 
evaluation  per  busy  tick,  average  number  of  events  per  busy  tick,  time  to  perform  a  functional 
evaluation,  and  average  number  of  functional  evaluations  per  busy  tick.  On  a  per  processor  basis, 
this  becomes  tCOMCPU  =  P  •  [(E  /  BP)tE  +(F  /  BP)tF],  (3  is  a  imbalance  coefficient  since  the 
computational  work  may  not  be  completely  uniform.  The  average  time  to  perform  a  functional 
evaluation  is  the  sum  of  all  functional  evaluations  divided  by  the  number  of  evaluations.  On  aver¬ 
age  there  are  MpIB  state  change  messages  over  the  communication  network  between  interacting 
slaves.  On  average,  each  processor  receives  Mp/BP  messages  on  each  busy  tick.  Messages  travel 
through  an  average  number  of  intermediate  nodes  which  must  incur  processing  time  to  forward 
them.  The  resultant  expression  for  CPU  time  spent  in  communications  overhead  is 
tCOMCPU  ~  01  •  {H  ■  (MP  /  2?)[(fCT  +  tCR ) /  P]+ (Mp  /  B)(tCF  /  P )} 
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a  is  a  term  representing  communications  imbalance  and  serves  a  similar  puipose  as  (5.  tcoMM  con¬ 
siders  the  expected  volume  of  inter-slave  messages  versus  the  available  bandwidth  of  the  architec¬ 
ture  as 

tcOMM  ~  ®  '  H{MpIB)-[(tm+tLV)IW] 

The  last  term  tsmc  is  assumed  to  be  accomplished  via  a  complete  exchange.  In  a  hypercube  archi¬ 
tecture,  each  processor  sends  a  message  down  log -J3  adjacent  channels. 

*SYNC  =  *LV  '  1°§2  P  +  *IM  —  1) 

A  version  of  this  model  was  used  to  simulate  several  circuits  allocated  by  a  simulated  annealing 
partitioning  strategy.  The  results  were  disappointing  in  that  random  allocation  consistently  ex¬ 
ceeded  the  performance  of  the  simulated  annealing  allocation.  Chamberlain  and  Franklin  cite  the 
following  reasons  for  this  unexpected  result: 

•  The  cost  model  does  not  consider  load  balancing  while  that  is  all  the  random  partition  consid¬ 
ers  -  load  balancing  is  a  pertinent  parameter. 

•  The  frequency  of  evaluation  was  assumed  uniform  over  all  components.  Some  components 
were  executed  12  times  more  than  the  average  causing  a  significant  computational  and  com¬ 
munications  imbalance. 

•  Achievable  parallelism  is  limited  by  the  global  clock  protocol  to  the  number  of  processors  able 
to  simulate  at  any  single  point  in  simulation  time. 

Additional  problems  with  this  model  include  the  high  degree  of  knowledge  required  of  a  simulation 
and  its  environment  Some  terms  can  only  be  determined  by  running  the  simulation  once  and 
feeding  that  information  back  into  the  cost  model.  Many  terms  are  average  quantities  that  would 
suffer  the  same  failure  as  assuming  uniform  activation  frequency  for  behaviors.  Significant  devia¬ 
tions  from  the  average  are  probable  and  will  throw  the  model.  While  the  terms  themselves  are 
questionable,  the  way  in  which  they  were  identified  and  determined  are  admirable.  Any  model  de- 
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velopment  must  reflect  the  theoretical  basis  of  the  system  in  meaningful  terms;  otherwise  the  model 
becomes  incomprehensible  andunextendable. 


2.5  Parameters  of  Partition  Models 

An  early  question  in  model  development  is,  “which  parameters  are  important?”.  Failure  to 
identify  a  robust  set  of  candidates  will  ignore  pertinent  degrees  of  freedom  (dimensions)  of  the  tar¬ 
get  space  and  limit  the  correctness  of  the  model  to  a  subset  of  that  space.  What  parameters  are 
important  in  predicting  the  runtime  of  conservative  VHDL  simulations?  Certainly,  the  number  of 
gates  is  influential.  Also,  the  size  and  nature  of  the  test  vector  is  material.  This  section  presents 
additional  parameters  of  merit. 

2.5.1  Lookahead 

The  minimum  delay  through  an  LP  of  the  simulation  allows  the  prediction  of  the  earliest 
subsequent  event  to  come  from  that  LP.  Special  considerations  must  be  given  to  event  generating 
applications  like  VHDL  simulation  which  will  be  discussed  later.  “Quantitatively,  if  a  process  has 
knowledge  of  all  events  that  will  occur  up  to  simulated  time  T,  and  can  predict  all  new  events  it 
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will  generate  with  timestamp  T  +  L  or  less,  then  a  process  is  said  to  have  lookahead  L”  (Fujimoto, 
1988-1 :34).  A  lower  bound  on  lookahead  is  the  minimum  delay  through  that  process.  By  the 
definition,  additional  knowledge  (perhaps  event  dependent)  could  allow  larger  lookahead  instances. 
But  this  would  require  more  intelligent  processing.  Minimum  delay  is  static  for  static  problem 
graphs  and  is  generally  synonymous  with  lookahead  used  in  practice.  Reducing  lookahead  limits 
the  parallelism  inherent  in  the  system.  For  example,  in  Figure  5,  Queue  A  sends  an  arrival  event  to 
B  for  each  local  departure  event.  If  the  processing  time  of  each  queue  is  known  (say  5),  greater 


Figure  6  Multiple  Paths  and  Lookahead 


parallelism  can  be  exploited  by  immediately  sending  B  its  arrival  for  T+  5.  This  is  an  overly  sim¬ 
plified  example  since  Queue  A  may  have  jobs  in  its  queue  and  cannot  process  the  new  arrival  for 
some  time.  However,  this,  too,  is  a  known  quantity.  Instead  of  sending  arrival  Jime  +  process¬ 
ing time,  A  can  send  an  arrival  to  B  at  LastScheduledFinishTime  +  processing  jime.  Conserva¬ 
tive  simulations  are  acutely  sensitive  to  cyclic  dependencies.  Expanding  the  previous  single  queue 
example  to  the  LP  level,  lookahead  is  critical  to  the  speed  at  which  cycles  are  wound  up  and  simu¬ 
lation  allowed  to  proceed. 
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As  previously  mentioned,  circuit  simulation  differs  from  queuing  simulation  in  that  events 
can  procreate  or  terminate  between  source  and  sink  nodes.  Multiple  paths  through  an  LP  may 
cause  events  to  arrive  at  a  destination  LP  at  a  time  increment  less  than  the  minimum  delay  through 
the  LP.  As  shown  in  Figure  6,  the  upper  path  is  12  hops  and  the  lower  path  is  minimally  10  hops. 
Consequently,  any  event  arriving  at  LPO  will  not  be  seen  by  LP1  until  at  least  t  +  10.  It  would  be 
beneficial  to  allow  LP1  to  simulate  up  to  this  time.  Unfortunately,  a  single  event  to  LPO  may  gen¬ 
erate  many  arrivals  at  LP1.  An  event  at  t  may  generate  arrivals  to  LP1  at  times  tt-10,  t+12,  t+ 16, 
t+22, ... .  This  is  caused  by  multiple  paths  and  an  included  cycle  which  may  periodically  generate 
events  indefinitely.  Thus,  an  arrival  at  LPO  at  t  does  not  free  LP1  to  simulate  up  to  t  +  10.  A 
previous  arrival  at  LPO  may  have  caused  events  destined  for  LP1  at  times  less  than  t  +  10. 

This  does  not  completely  deny  the  benefit  of  lookahead  to  digital  simulations,  it  just  re¬ 
quires  more  intelligence  in  determining  lookahead.  LPs  are  informed  of  lookahead  possibilities  via 
null  messages.  Upon  any  arrival,  the  source  LP  adds  its  minimum  delay  to  the  arrival  timestamp 
and  sends  an  appropriate  null  to  the  destination  LP.  As  discussed,  minimum  delay  is  not  guaran¬ 
teed  to  represent  the  minimum  time  increment  for  a  departure  from  the  source;  the  source  must  test 
for  earlier  departures.  A  simple  way  to  do  this  is  to  check  the  next  event  list  of  the  local  sequential 
simulator.  The  time  of  the  earliest  event  on  the  list  is  a  lower  bound  on  potential  departures.  Thus, 
a  safe  time  to  send  adjacent  LPs  is  vAnlt^t  event,  tciock+delay].  The  first  term  of  the  min  function  is 
extremely  limiting  since  many  events  will  be  in  the  next  event  list  with  timestamp  less  than  tsafe5 for 
the  destination  LP.  However,  it  is  hoped  that  the  later  term  will  be  used  with  enough  frequency  to 
yield  the  benefits  of  lookahead.  Figure  7  demonstrates  that  as  the  number  of  LPs  grow  (and  com- 


5A11  incoming  events  to  an  LP  are  guaranteed  to  be  later  or  equal  to  tsafe.  Thus,  the  LP  may  safely  prog¬ 
ress  to  that  point. 
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Figure  7  Nulls  Posed  with  Delay  and  Event  List  Timestamps  (Wallace  Tree) 


putation  becomes  finer  grained),  the  number  of  null  messages  posed  for  transmission  with  the 
minimum  delay  timestamp  grows  relative  to  those  posed  with  the  minimum  event  list  timestamp. 


2.5.2  Lookahead  Ratio. 

Fujimoto  added  to  the  definition  of  lookahead  by  observing  that  the  absolute  value  for  loo¬ 
kahead  is  not  as  important  as  the  lookahead  relative  to  the  time  between  the  arrival  and  departure 
of  an  event.  Similarly,  Wagner  and  Lazowska  observed  that  “since  LPs  affect  each  others’  clocks 
by  exchanging  messages,  this  implies  that  lookahead  values  need  to  be  comparable  to  the  average 
message  timestamp  increment  in  order  to  be  useful”  (Wagner  and  Lazowska,  1989:150).  Intui¬ 
tively,  a  lookahead  of  x  is  somewhat  meaningless  unless  x  is  large  relative  to  the  expected  delay 
through  the  circuit.  Fujimoto,  therefore,  defines  Lookahead  Ratio  (LAR)  as 

,  mean  timestamp  increase  ^  T  ,  , .  „ 

LAR  = - (Fujimoto,  1988-2:18).  Note  that  this  formula  assumes  that 

lookahead 
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lookahead  is  constant.  In  reality,  lookahead  can  be  unique  for  all  (source  LP,  destination  LP) 
pairs.  Furthermore,  Fujimoto  offers  no  advice  on  an  aggregate  LAR  metric  that  incorporates  the 
LAR  values  for  all  LPs  of  an  allocation. 

2.5.3  Contention 

Increasing  the  number  of  processors  upon  which  a  circuit  is  simulated  makes  the  applica¬ 
tion  finer  grained  (less  average  computation  between  communications).  Correspondingly,  com¬ 
munications  overhead  will  likely  increase  as  will  the  influence  of  contention.  Contention  occurs 
when  message  delivery  between  processors  is  delayed  due  to  unavailable  physical  channels  of  the 
underlying  hardware.  Chittor  and  Enbody  develop  a  mathematical  model  to  predict  actual  message 
throughput  (kp)  based  on  an  applied  message  rate  (ka)  and  a  saturation  message  rate  (ksat). 

%a-\)-\(K  +  \at)  +  k-K,  =  0  (Chittor andEnbody,  1991:11-3) 

This  model,  like  others,  requires  accurate  prediction  of  the  message  generation  rate  of  LPs  which  is 
very  difficult  to  come  by  for  logical  circuits.  Chamberlain  has  proposed  pre-simulation  to  form 
reasonable  estimates  of  circuit  activity. 

2.5.4  Fanout. 

Nicol  observed  that  the  higher  the  fanout,  the  more  descendent  nodes  are  held  back  by  con¬ 
servative  blocking.  This  is  particularly  true  of  fine  grained  applications.  Grouping  nodes  together 
onto  an  LP  as  well  as  partitioning  to  increase  lookahead  will  serve  to  diminish  the  amount  of 
blocking  occurring  in  the  simulation  (Nicol,  1991:34).  Then,  Nicol  maintains,  for  coarse  grained 
machines,  performance  degradation  is  not  due  to  blocked  processors,  but  due  to  communication 
and  synchronization  overheads  (Nicol,  1991:42). 
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2.6  Techniques  and  Special  Considerations 


Bottlenecks  in  the  problem  graph  present  inherit  limitations  to  parallelism  that  may  not  be 
observable  in  the  actual  system  being  simulated.  Wagner  and  Lazowska  give  the  example  of  a 
queuing  network  model  of  computer  terminals  accessing  a  central  CPU  with  multiple  disks  at¬ 
tached  (see  Figure  8).  In  an  actual  system  as  depicted,  the  service  time  of  the  CPU  would  be  much 
less  than  the  service  time  of  the  disks  such  that  the  CPU  would  not  delay  disk  service  requests.  In 
a  simulation,  however,  a  CPU  service  and  disk  service  are  of  the  same  relative  magnitude.  Ideally, 
all  nodes  of  Figure  8  would  be  simulated  on  unique  processors.  However,  since  the  disks  receive 
all  requests  from  the  CPU  and  the  maximum  utilization  of  the  CPU  is  1.0,  the  sum  of  disk  utiliza¬ 
tion  cannot  exceed  1.0.  Nor  can  the  terminals’  effective  utilization  exceed  the  consumption  rate  of 
their  only  destination.  Thus,  the  maximum  parallelism  available  in  this  model  is  3.0  (Wagner  and 
Lazowska,  1989:147). 

A  fundamental  structure  of  discrete  event  simulation  is  the  next  event  list  (also  imprecisely 
called  the  next  event  queue).  This  structure  will  hold  all  the  events  generated  and  processed  on  an 
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LP.  Its  efficiency  is  material  to  the  speed  of  processing.  Nicol  recommends  a  splay  tree  implemen¬ 
tation  for  large  event  populations  (Nicol,  1993:325). 

2.7  Statistical  Analysis 

“The  goal  of  scientific  analysis  is  the  collection  and  organization  of  information  concern¬ 
ing  the  world  around  us  with  the  goal  of  increasing  our  understanding  of  the  things  we  observe.” 
(Thorndike,  1978:3).  Thorndike  continues  to  outline  the  stages  of  scientific  discovery:  observa¬ 
tion,  organization  and  prediction,  explanation  and  understanding.  This  study  fits  in  the  second 
stage  of  “exploration”.  While  there  are  many  parameters  with  documented  relationships  to  simu¬ 
lation  runtime,  many  others  remain  with  inter-relationships  that  are  currently  unexplored  and  un¬ 
defined.  Using  basic  techniques  of  statistical  analysis,  further  insight  to  parallel  simulation  model¬ 
ing  is  sought. 

Numbers  are  a  convenient  way  to  label  parameter  values.  Despite  the  absolutism  of  digits, 
numbers  do  not  always  cany  the  same  meaning.  When  measuring  parameters  of  interest,  four 
types  of  scaling  are  pertinent.  Nominal  scaling  attempts  no  quantitative  assessment;  it  merely 
identifies  membership  to  a  particular  category  (e.g.  male,  female).  Ordinal  scaling  makes  general 
quantitative  implication  within  a  category  (e.g.  IQ  levels).  Interval  scaling  provides  a  precise 
quantitative  relationship  between  differing  values  within  a  category  ( e.g.  degrees  Celsius).  Ratio 
scaling  is  the  final  type  and  is  interval  scaling  where  0  represents  the  absence  of  that  trait  in  the 
sample  (e.g.  degrees  Kelvin,  or  age). 

The  principle  of  least  squares  forms  the  basis  of  correlational  procedures.  Finding  a  least 
squares  solution  to  a  problem  asserts  that  any  other  point,  line,  or  curve  will  result  in  a  larger  sum 
if  the  difference  between  the  estimator  and  each  measured  result  is  squared  and  summed  together. 

It  provides  the  most  accurate  predictor  “on  average  for  the  group”  (Thorndike,  1978:21).  It  does 
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this  regardless  of  the  shape  of  the  observed  data.  However,  if  the  data  is  decidedly  non-normal, 
other  statistics  may  be  better  suited  for  the  analysis.  Each  parameter  will  vary  over  observed  data 
points.  If  not,  no  significant  conclusion  can  be  made  about  the  parameter  and  the  criterion  to  be 
predicted.  If  a  parameter  increases  as  the  criterion  increases,  there  is  a  positive  correlation.  If  a 
parameter  change  of  amount  x  consistently  corresponds  to  a  criterion  change  of  amount  y,  then  a 
high  degree  of  correlation  can  be  expected  between  the  parameter  and  the  criterion.  Variance  of 
the  criterion  can  therefore  be  explained  by  variance  in  the  parameter.  The  ratio  of  explained  vari¬ 
ance  (SSR)  to  unexplained  variance  (SSE)  indicates  the  significance  of  the  model,  or  the  ability  to 
conclude  that  the  predictor  is  not  coincidentally  related  to  the  criterion.  The  total  variance  (SSy)  is 
the  sum  of  explained  and  unexplained  variance. 

Particularly  important  to  the  accuracy  of  a  statistical  model  is  the  assumption  of  linearity. 
If  the  data  samples  fail  to  follow  a  linear  form,  than  the  use  of  a  linear  model  is  inappropriate.  The 
difficulty  becomes  “seeing”  if  the  data  is  linear  in  ^-dimensional  space.  Linearity  of  two  dimen¬ 
sional  cross-sections  (partials)  of  the  overall  sample  space  does  not  imply  linearity  of  the  problem. 
Based  on  a  sound  theoretical  model,  mathematical  techniques  can  be  used  to  form  a  linear  model. 
Alternatively,  polynomial  and  logarithmic  regression  techniques  are  available.  A  linear  regression 
equation  has  the  following  form: 

y=  P0  +  Pr*i+— +P*-** 

Equation  2  General  Linear  Equation 

or,  in  matrix  form, 


Po 


LPjfcJ 
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or 

y-$*x 

Equation  3  Matrix  form  of  General  Linear  Equation 
Equation  3  represents  the  prediction  of  a  single  criterion  value  y,  given  a  parameter  vector  x  and  a 
coefficient  vector  P .  Each  entry  of  the  parameter  vector,  xj,  represents  a  single  sample  measure 
for  some  unique  parameter.  For  sample  i  of  the  sample  set, 

y,  =  f  •*, 

Let  X  be  a  matrix  such  that  row  i  of  X  =  xi  V i .  This  two-dimensional  matrix  has  each  row  as  the 
vector  of  parameter  values  for  a  single  data  sample.  Each  column  is  the  vector  of  values  for  a 
single  parameter  over  all  samples.  Using  the  vector  yt  Vi  as  the  predicted  results,  sum  of  squares 

comparison  to  observed  data  is  possible  as  is  correlational  and  significance  testing. 

Alternatively,  observed  y  values  can  be  substituted  in  the  above  equation  to  find  the  (3  that 
results  in  the  minimum  sum  of  squares.  Since  the  rest  of  the  discussion  uses  only  the  vector  and 
matrix  forms  of  variables,  let  P=P.  x  =  X,  and  y=  vector  of  observed  runtime  values.  As  such,  the 
following  equation  finds  p: 


Equation  4  Coefficients  of  Linear  Model 
Sum  of  square  error  not  explained  by  parameters  can  then  be  found  by: 

SSe  :=  yT-y-  pT-xT-y 

Equation  5  Residual  Error  of  Sum  of  Squares 
Sum  of  squares  explained  by  regression  is: 
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T  T 

SSr  :=  p  -x  -y- 


n 


Equation  6  Regressive  Error  of  Sum  of  Squares 
To  test  the  significance  of  the  model,  the  appropriate  F  statistic  is  calculated  by: 

SSR/k  _msr 
0  SSE/(n-k- 1)  MSe 


reject  insignificance  if  F0  >  FaJcn_k_l 


Equation  7  Linear  Regression  Test  of  Significance 


where  k  is  the  number  of  parameters  (columns)  in  the*  matrix,  and  n  is  the  number  of  samples 
taken  (rows  of  *  matrix). 

If  a  model  has  already  been  established,  statistical  testing  is  straightforward  using  the 
above  equations.  If  no  model  exists,  one  must  be  formed  from  the  candidate  set  of  parameters. 
Many  techniques  are  available  to  construct  a  linear  model.  Thorndike  proposes  starting  with  the 
candidate  with  the  highest  correlation.  Add  to  this  the  parameter  which  results  in  the  greatest  in¬ 
crease  of  the  F0  statistic.  Test  each  parameter  of  the  combined  model  for  significance  by  a  new 


(SS  —  SS' ) 

statistic  FP  =  — 7— — —  where  SS'R  is  the  regressive  sum  of  squares  of  the  linear  model  exclud- 
MSe 

ing  the  parameter  of  interest.  Essentially,  this  test  sees  if  the  marginal  contribution  of  the  parame¬ 
ter  of  interest  is  significant.  If  the  marginal  test  indicates  insignificance,  remove  that  parameter 
from  the  model.  Continue  building  by  the  F0  test. 

It  is  difficult  to  assess  the  general  applicability  of  a  model  built  from  sample  data.  That 
sample  data  may  be  biased  or  insignificant  to  the  characteristics  of  the  overall  population  from 
which  the  samples  are  drawn.  Thorndike  recommends  a  double  cross-validation  technique  to  limit 
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this  risk.  The  sample  set  is  divided  in  half  and  independent  models  developed  for  both  halves.  If 
each  independently  developed  model  appears  similar  to  the  other  in  form  while  effectively  predict¬ 
ing  the  other’s  sample  data,  then  the  parameter  set  of  the  model  is  verified  and  can  be  redeveloped 
on  the  combined  sample. 

2.8  Summary 

Conservative  parallel  discrete  event  simulation  carries  with  it  assumptions  of  the  basic 
data  structures  used.  Conservative  simulation  guarantees  that  an  LP  will  never  receive  a  remote 
event  with  a  timestamp  less  than  its  current  clock.  It  does  this  by  blocking  an  LP  until  all  input 
channels  have  a  most  recent  communication  timestamp  greater  or  equal  to  the  next  event  time  of 
the  LP.  Blocking,  however,  creates  the  risk  of  deadlock  which  can  be  eliminated  with  the  use  of 
null  messages.  Previous  AF1T  researchers  succeeded  in  achieving  limited  speedup  of  circuits  using 
random,  simple  data,  and  iterative  improvement  partitioning  strategies.  Key  to  hill-climbing  or 
annealing  searches  is  a  cost  function  to  evaluate  the  worth  of  the  current  search  position.  Cham¬ 
berlain  developed  a  model  for  a  synchronous  protocol  simulation  and  others  like  Nicol,  Fujimoto, 
and  Wagner  have  identified  parameters  of  interest  to  any  cost  model.  No  published  cost  model  has 
been  demonstrated  as  statistically  significant  to  partial  or  total  ordering  by  runtime  of  VHDL  cir¬ 
cuit  simulation.  One  particular  difficulty  with  VHDL  is  the  inability  to  predict  load  balance  since 
nodes  may  create  or  destroy  events  differently  depending  on  the  input  vector.  Using  statistical 
analysis  may  offer  some  insight  into  model  building  and  add  the  legitimacy  of  mathematical  sig¬ 
nificance  to  proposed  models. 


33 


3.  Methodology  and  Tools 


3.1  Overview 

The  approach  to  this  research  is  outlined  by  the  steps  of  statistical  analysis  discussed  in 
Section  2.7. 

1.  Using  the  Graph  Partitioning  Tool  (GPT)  from  Kapp’s  research,  generate  a  sample  set  of 
Wallace  Tree  simulations. 

2.  Use  this  sample  set  to  form  a  cost  model  using  double  cross-validation. 

3.  Simulate  the  Wallace  Tree  to  create  a  validation  set  to  test  the  reliability  of  the  statistical  cost 
model 

4.  Simulate  the  Associative  Memory  circuit  to  test  the  general  applicability  of  the  statistical  cost 
model. 

Support  of  this  plan  required  major  revisions  to  the  Graph  Partitioning  Tool  (GPT).  Also,  addi- 
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tional  instrumentation  of  SPECTRUM  provided  insight  into  the  work  breakdown  of  processors  in 
VHDL  simulations. 


3.2  Spectrum 


SPECTRUM  was  originally  developed  at  the  University  of  Virginia  on  a  BBN  Butterfly. 
AF1T  researchers  converted  this  code  for  execution  on  the  Intel  iPSC/2  hypercube.  SPECTRUM 
uses  a  SPMD  (Single  Program  Multiple  Data)  model  which  means  that  each  processor  has  its  own 
copy  of  the  simulation.  The  state  space  is  divided  among  processors  using  the  configuration  files 

Table  2  SPECTRUM  Standard  Filter  Descriptions 


Filter  Name 

Access  Control 

Initialization 

Invoked  as  part  of  simulation  startup,  this  fills  an  array  with  the 
addresses  of  filter  functions  for  the  six  defined  filters. 

Get  Event 

Upon  a  request  by  the  application  for  the  next  event  to  process 

Post  Event 

Upon  a  request  by  the  application  to  send  a  specified  event  to  some 
remote  destination 

Post  Message 

Upon  a  Node  Manager  request  to  provide  a  remote  event  to  the 
local  LP  Manager 

Enqueue  Event 

Upon  needing  to  place  a  specified  event  into  the  next  event  list 

Advance  Time 

Upon  needing  to  update  the  LP  Manager  clock 

Termination 

Upon  conclusion  of  the  simulation  usually  to  collect  and  print  sta¬ 
tistics 

lpx.arcs  and  Ipx.map  where  x  is  the  number  of  LPs.  Figure  9  shows  the  structure  of  SPECTRUM 


which  uses  various  levels  of  abstraction  to  simplify  and  standardize  the  interface  to  interprocessor 
communication  for  the  application.  As  has  been  demonstrated  at  AHT,  applications  using  SPEC¬ 
TRUM  have  been  ported  to  other  architectures  with  little  or  no  recoding  of  layers  above  the  node 
level  interface.  SPECTRUM  claims  compatibility  with  any  synchronization  protocol  and  does  not 
innately  favor  conservative  versus  optimistic  techniques.  In  other  words,  SPECTRUM  does  not 
implement  a  protocol  which  must  be  developed  separately.  It  does,  however,  provide  a  structure 
upon  which  a  protocol  can  be  built.  The  synchronization  protocol  is  implemented  in  the  filter  layer 
of  SPECTRUM.  Well  defined  points  of  execution  allow  filter  routines  to  be  called  and  thereby 
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control  the  communication  between  LPs.  As  can  be  seen  by  the  dependencies  of  the  various 
“layers”  of  SPECTRUM,  the  LP  Manager,  Node  Manager,  and  Filter  are  tightly  coupled.  The 
Filter  layer  is  particularly  complicated  with  its  access  to  (and  knowledge  of)  all  layers  but  the  op¬ 
erating  system.  Overall,  the  application  benefits  from  a  defined,  simplified  interface  to  SPEC¬ 
TRUM  and  SPECTRUM  benefits  from  a  defined  localized  interface  to  the  operating  system. 

The  interaction  of  the  SPECTRUM  abstractions  are  important.  Only  LP  Manager  directly 
calls  filter  routines  which  are  “registered”  with  the  LP  Manager  as  a  part  of  the  initialization  of  the 


— 

A 

Initialize 

Get  Event 

Post  Event 
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_ 

_ 

Figure  10  Redirection  of  LP  Manager  calls  to  Filter 


system.  Those  routines  and  invocation  points  are  described  in  Table  2.  As  visible  in  Figure  10, 


the  basic  discrete  event  simulator  is  broken  between  the  application  and  SPECTRUM.  LP  Man¬ 
ager  controls  the  next  event  list  and  a  clock  reflecting  the  time  of  the  last  message  processed.  The 
application  performs  the  interpretation  and  consequence  of  events  which  it  applies  to  its  state 
space.  When  the  application  requires  an  event  to  process,  it  requests  it  from  the  LP  Manager.  If  a 
Get  Event  filter  is  defined,  LP  Manager  redirects  the  request  to  the  Filter  layer.  The  behavior  of 
the  Get  Event  filter  is  protocol  dependent.  For  a  null  message  protocol,  Get  Event  will  block  until 
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it  is  able  to  find  an  event  in  the  next  event  list  with  a  timestamp  earlier  than  or  equal  to  the  safe 
time.  This  may  require  waiting  until  a  remote  LP  sends  an  event  with  the  appropriate  timestamp  or 
a  remote  LP  sends  a  message  that  allows  updating  tsa/e.  Similar  redirection  of  LP  Manager  calls  to 
the  filter  occur  for  the  other  filter  functions. 


3.3  VS1M 


VSIM  is  the  application  developed  by  Comeau  and  Breeden  that  simulates  VHDL  circuits. 
It  is  actually  an  extension  of  a  sequential  VHDL  simulator  by  Intermetrics.  Consequently,  the  ba¬ 
sic  interaction  of  the  application,  VSIM,  and  SPECTRUM  is  modified  from  Figure  10.  As  seen  in 


a  VSIM  specific  reproduction  of  the  previous  diagram,  Figure  1 1  shows  that  VSIM  is  a  simulator 
unto  itself  and  has  its  own  event  list  and  clock.  VSIM,  however,  has  devices  to  limit  its  behavior 
evaluation  to  some  subset  of  the  whole  circuit  based  on  the  LP  number  and  the  LPxmap  configu¬ 
ration  file.  Signal  changes  that  affect  behaviors  mapped  to  other  LPs  result  in  VSIM  forming  an 
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event  with  the  appropriate  destination.  VSIM  sends  the  remote  event  via  a  call  to  LP  Manager’s 
Post_Event  function  upon  retrieving  it  from  the  local  event  list  at  its  designated  time.  All  events 
are  inserted  in  the  VSIM  event  list.  Built  around  the  conservative  protocol,  VSIM  will  request  an 
event  from  SPECTRUM  if  the  time  of  its  next  event  exceeds  the  time  of  the  SPECTRUM  clock. 
SPECTRUM  must  provide  a  valid  event  or  update  the  clock  for  VSIM  to  continue  processing. 
Thus,  VSIM  only  invokes  SPECTRUM  to  send  or  receive  remote  information  (events  or  channel 
times).  Currently,  the  only  access  SPECTRUM  has  to  the  status  of  VSIM  is  getJow_time() 
which  provides  the  time  of  the  earliest  event  on  the  VSIM  event  list. 

3.3.1  VHDLClocks  94 

VHDLCLOCKS  is  the  null  message  filter  implementation  used  by  Kapp.  VHDLClocks94 
is  a  modified  version  that  eliminates  some  inefficiencies  and  adds  additional  instrumentation  in  or¬ 
der  to  gain  insight  into  the  activities  of  the  processor.  For  this  discussion,  let: 

tin  :  Minimum  time  of  all  input  channels  =  tsa/e 
tsLut  :  Minimum  timestamp  of  event  in  SPECTRUM  Event  List 
tvusi  ■  Minimum  timestamp  of  event  in  VSIM  Event  List 

tnuii  :  timestamp  on  null  message  =  min[fvurr,  f,«  +  delay] 

Figure  12  captures  the  most  complex  synchronization  of  the  new  protocol.  The  Get  Event  filter 
applies  most  policies  of  the  protocol  in  its  own  code  or  by  invocation  of  the  other  filters  (not  de¬ 
limited  in  the  figure).  The  Post  Event  filter  sends  nulls  down  all  channels  not  used  by  the  message 
being  sent  and  is  invoked  upon  finding  a  message  to  another  LP  in  the  VSIM  event  list.  The  proc¬ 
essor  can  be  involved  in  one  of  three  disjoint  states:  application  processing,  synchronization 
processing,  and  blocking.  Measuring  the  time  spent  in  each  of  these  states  for  all  LPs  in  the 
simulation  may  reveal  pertinent  or  even  limiting  factors  of  runtime.  VHDLQocks94  measures 
from  the  invocation  of  the  LP  Manager’s  Get_Event  routine  until  it  exits,  including  all  time  con¬ 
sumed  in  the  Get  Event  filter.  The  Get  Event  filter  is  the  only  portion  of  code  that  implements 


blocking..  tm0ck  begins  at  the  demarcation  indicated  in  Figure  12  and  ends  when  the  Get  Event  fil¬ 


ter  exits  (i.e.  upon  receiving  an  event  that  allows  application  processing  to  continue).  tPos,  is  the 
time  spent  in  the  Post  Event  Filter.  Since  these  getting  and  posting  events  represents  almost  the 
entirety  of  protocol  overhead,  the  sum  of  tcet  and  tPajl  approximates  the  time  spent  in  synchroniza¬ 
tion  processing  and  blocking.  Since  only  one  LP  will  be  allocated  to  a  processor,  task  switching 
overhead  can  be  ignored.  Ignoring  other  operating  system  induced  overheads,  the  difference  be¬ 
tween  observed  runtime  for  an  LP  and  its  measured  synchronization  overhead  is  a  good  estimate 
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for  the  application  processing  time.  Summary  definitions  for  an  LP  are: 

tGe,  :  wall  time  spent  in  the  LP  Manager  Get  Event  subroutine 

tpost  :  wall  time  spent  in  the  LP  Manager  Post  Event  subroutine 

t Biock  :  wall  time  spent  waiting  for  a  message  to  allow  continued  processing 

tsync  :  wall  time  spent  processing  synchronization  protocol  routines  =  tGe,  +  tPos,  -  tBhck 

tproc  :  wall  time  spent  processing  the  application  =  Runtime  -  ( tGet  +  tPosi) 

VHDLClocks94  corrects  an  overflow  error.  When  generating  null  messages,  time  stamps  may  be 
calculated  as  the  sum  of  the  input  channel  time  and  the  minimum  delay  to  the  specific  output  chan¬ 
nel.  Since  VSIM  operates  in  picoseconds,  a  2000  ns  simulation  concludes  at  timestamp 
2,000,000,000.  The  maximum  value  for  the  timestamp  datatype  is  2,147,483,648.  Input  channels 
plus  delay  times  were  occasionally  exceeding  this  maximum  and  appearing  as  negative  values. 

This  cause  of  overflow  has  been  corrected,  but  the  risk  is  still  present  in  all  time  calculations  when 
large  numbers  are  used  with  a  restricted  datatype. 

3.4  Graph  Partitioning  Tool  (GPT) 

Kapp  developed  the  Graph  Partitioning  Tool  v2.0  from  an  earlier  version  written  by  Major 
Eric  Christensen,  USA.  Kapp’s  most  significant  improvement  from  the  initial  version  was  to  add 
the  ability  to  perform  cost  model  based  partitioning.  Structures  and  procedures  were  added  to  al¬ 
low  various  measurements  of  the  graph  and  its  allocation.  These  measurements  can  then  be  used 
as  terms  in  a  cost  model  to  ordinally  compare  two  allocations.  Kapp’s  AB  Improvement  routine 
uses  iterative  improvements  of  the  cost  model  to  find  better  allocations. 

Unfortunately,  the  tool  uses  a  partial  cost  model  to  search  for  iterative  improvements.  The 
partial  cost  model  is  much  more  efficient  to  evaluate,  allowing  rapid  execution  of  the  search.  The 
primary  goal  of  the  new  version  of  the  Graph  Partitioning  Tool  (version  3.0)  is  to  localize  the  cost 
model  to  a  single  Ada  package  and  let  the  AB  Annealing  routines  use  the  entire  cost  model  in  its 
decisions. 
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3.4.1  Minimum  Delay 

Lookahead  is  ignored  in  the  AB  Annealing  routines  of  GPT  v2.0.  This  is  not  an  arbitrary 
oversight  since  the  calculation  of  minimum  distance  for  graph  partitions  is  O (input  arcst  x  rif) 
where  m  is  the  number  of  behaviors  in  LPi  and  input  arcs ;  is  the  number  of  input  arcs  for  LPi.  The 
AB  Annealing  routine  considers  a  subset  of  the  total  number  of  vertices  and  must  re-evaluate  all 
LPs  in  the  circuit  for  each  allocation  raising  the  cost  to  O (numLPs  x  input  arcst  x  «/).  With  this 
complexity,  it  becomes  difficult  to  amortize  a  week  long  allocation  analysis  for  a  30  second  simu¬ 
lation.  While  partitioning  efficiency  is  not  a  goal  of  this  research,  reasonable  performance  is  re¬ 
quired  to  progress.  As  an  implementation  note,  delays  are  implemented  by  association  with  com¬ 
munication  channels.  Reference  to  a  minimum  delay  will  be  equivalently  made  in  terms  of  the  LP 
pair  or  the  appropriate  adjacent  channel. 

3.4. 1.1  Dijkstra 

Dijkstra’s  algorithm  is  the  most  efficient  (  0(n5) )  known  for  the  task  of  finding  the  mini¬ 
mum  distance  between  all  nodes  of  a  graph  with  non-negative  distances.  The  actual  problem  is  to 
find  the  minimum  distance  from  the  subset  of  input  nodes  to  the  subset  of  output  nodes  for  each 
LP.  Since  the  graph  of  a  circuit  is  static,  one  could  find  the  minimum  distances  between  all  nodes 
and  then  store  this  information.  Subsequent  partitioning  analysis  could  then  reference  this  archived 
data  instead  of  calculating  it  at  runtime.  For  a  one  time  cost,  efficient  use  of  minimum  delay  and 
lookahead  can  be  added  to  cost  model  partitioning.  Following  this  reasoning,  Dijkstra  minimum 
distance  (delay)  calculation  is  offered  as  a  menu  option  in  GPT  v3.0. 

3.4. 1.2  Dynamic  Search 

Unfortunately,  Dijkstra’s  also  carries  an  0(n2)  space  requirement.  The  4  Mb  of  disk  space 
for  the  Wallace  Tree  becomes  64  Mb  for  the  4,243  behavior  Associative  Memory  circuit.  Addi¬ 
tionally,  only  knowing  the  minimum  delay  of  an  LP  does  not  allow  accurate  calculation  of  Fuji- 
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moto’s  Lookahead  Ratio.  All  paths  through  an  LP  contribute  to  the  average  delay  of  that  LP.  It  is 
the  ratio  of  average  delay  to  minimum  delay  that  defines  Lookahead  Ratio.  There  is  justification 
for  dynamically  measuring  all  paths  through  an  LP.  Therefore,  an  alternate  minimum  delay  meas¬ 
urement  device  is  offered  as  a  menu  option.  Dynamic  measurement  is  implemented  as  a  depth  first 
search  and  may  result  in  unreasonably  long  partitioning  analysis.  But,  when  feasible,  it  will  give 
better  assessment  of  the  delays  in  an  LP. 

3.4. 1.3  Worst  Case  Delay 

As  the  circuits  get  very  large,  neither  of  the  above  options  may  be  reasonable.  As  a  fail¬ 
safe  option,  the  user  can  enter  the  minimum  gate  delay  to  be  used  as  the  distance  through  all  LPs. 
This  guarantees  the  quickest  partitioning  analysis  and  correct  simulation,  but  eliminates  lookahead 
as  a  configurable  parameter.  Very  likely  this  technique  of  delay  determination  will  inhibit  simula¬ 
tion  performance  due  to  the  under  estimated  minimum  delay. 

3.4. 1.4  Infinite  Delay 

An  interesting  situation  is  the  occurrence  of  a  source  node  in  one  LP  and  its  descendants  in 
adjacent  LPs.  If  no  other  paths  make  up  such  a  communication  channel,  it  is  not  obvious  what  the 
delay  of  that  channel  should  be.  Sources  have  a  delay  of  zero  since  they  perform  no  processing. 
However,  a  channel  delay  of  zero  could  contribute  to  a  violation  of  the  null  protocol  if  a  zero  cycle 
formed  between  LPs.  Based  on  an  understanding  of  the  behavior  of  the  simulation,  GPT  v3.0  as¬ 
signs  this  channel  a  delay  of  infinity  (implemented  as  MAX  INTEGER).  Upon  initialization,  all 
events  of  the  testbench  are  scheduled.  Consequently,  all  the  events  of  a  source  are  in  the  VSIM 
event  list  of  the  host  LP.  Since  null  messages  are  timestamped  tin  +  delay],  as  long  as 

events  of  the  source  are  in  the  VSIM  event  list,  tvust  will  be  the  timestamp  of  the  null  messages 
down  the  anomalous  channel.  The  adjacent  LP  will  not  receive  the  infinite  delay  null  message  until 
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there  are  no  more  events  in  the  VSIM  event  list.  If  the  VSIM  event  list  empties,  an  infinite  delay 
may  be  sent  down  the  source’s  channel  because  it  will  not  be  generating  any  more  events. 

3.4.2  Graph  Parameters 

The  following  parameters  are  measured  by  the  GPT  v3.0  statistics  gathering  package: 


Table  3  Graph  Partitioning  Tool  Measured  Parameters 


Number  of  Vertices 

Count  of  behaviors  in  problem  graph 

Number  of  Arcs 

Count  of  arcs  in  problem  graph 

Number  of  Compo¬ 
nents 

LPSNUm 

Number  of  LPs  for  which  the  circuit  was  allocated 

Inter-LP  Arcs 

ArCSNumlnterlP 

Count  of  arcs  that  cross  LP  boundaries 

Channels 

Count  of  the  channels  between  LPs;  channels  encapsulate  arcs 

Weight  of  Inter-LP 
Arcs 

ArCSwtlnterLP 

Assuming  natural  assignment  of  LPs  to  processors,  a  weight 

Wy  can  be  assigned  to  communication  between  any  two  proc¬ 
essors.  If  ay  is  a  directed  arc  that  crosses  the  LP  boundary 
from  LPj  to  LPj  then 

Weight  of  Inter-LP  Arcs  =  X  X  ay  ‘  wy 

i  i  'J  ‘J 

Average  Weight  of 
Inter-LP  Arcs 

ArCSAvgWtInterLP 

Weight  of  Inter-LP  Arcs  /  Number  of  Components 

Standard  Deviation  of 
Output  Arcs  Weight 

ArcsstdwtOut 

Each  LP  has  a  set  of  output  arcs  which  sums  to  a  weight 
based  on  wy  values.  Standard  deviation  is  based  on  the  output 
weight  of  each  LP  versus  the  average  output  weight  for  all 

LPs. 

Maximum  Deviation 
of  Output  Arcs 

Weight 

A  T  CSu axWtOut 

If  OutWeightj  is  the  sum  of  the  weights  of  output  arcs  for  LPj, 
then 

Maximum  Deviation  of  Output  Arcs  Weight  = 
OutWeightj  -  OutWeightMg 

OutWeightj 

Standard  Deviation  of 
Input  Arcs  Weight 

ArcsstdWtin 

Similar  to  output  arc  counterpart  but  for  arcs  that  cross  the  LP 
boundary  into  LPs.  This  attempts  to  assess  the  balance  of 
communication  demands  for  both  input  and  output  messages. 

Maximum  Deviation 
of  Input  Arcs  Weight 

ArCS MaxW tin 

Similar  to  output  arc  counterpart.  Since  runtime  is  based  on 
max  runtime  of  the  LPs,  it  makes  sense  to  track  max  parame¬ 
ters  as  well  as  average  and  standard  deviation. 

Average  Lookahead 
Ratio 

Lookahead\n 

Fujimoto  defines  Lookahead  Ratio  as  average  message  delay 
over  minimum  message  delay.  Only  path  lengths  through  the 

LP  are  available  a  priori  to  estimate  average  delays.  Actual 
average  with  depend  on  event  frequency  down  the  various 
paths  as  well  as  event  generation  and  termination  within  the 

LP. 

Note  that  LPs  will  have  different  delays  for  different  pairs  of 
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input  and  output  nodes.  The  overall  minimum  does  not  apply 
to  output  nodes  not  on  that  path.  Thus,  the  minimum  dis¬ 
tances  for  all  input-output  node  pairs  in  an  LP  is  calculated 
resulting  in  a  single  minimum  path  measure  for  each  output 
channel. 

Lookahead  Ratio  is  defined  for  a  single  LP.  There  is  no  ag¬ 
gregate  measure  for  the  whole  simulation.  Consequently,  the 
aggregate  measure  used  here  is  found  by  summing  the  average 
paths  in  all  LPs  and  dividing  by  the  sum  of  minimum  paths 
through  all  LPs: 

£  delay, y 

_ Am*j _ 

#of  paths  from  LPt  toLP ) 

Average  Lookahead  = - ; - 

2jtain(delayjj) 

Alli*j 

Note  that  the  different  techniques  for  finding  minimum  delay 
will  produce  different  values  of  Average  Lookahead  since  they 
track  different  numbers  of  total  paths. 


Lookahead 


Maximum  Deviation 
of  Lookahead 


Average  Computa¬ 
tional  Load 


Maximum  Deviation 
Computational  Load 


Standard  Deviation 
Computational  Load 


itin 

cess 


Maximum  Runtime 


Speedup 


Nulls  Sent 


VList  Nulls  Posed 


Delay  Nulls  Posed 


Lookaheadstd 

Lookahead  Ratio  is  calculated  for  each  LP  and  compared  to 
the  average  over  all  LPs. 

Lookaheaduax 

Same  calculation  technique  as  other  maximum  quantities. 

Load  Ay  g 

Each  behavior  6,  is  assigned  a  computational  weight  cw„  Av¬ 
erage  load  for  all  LPs  is: 

£z>,-  ■  cwi 

Number  of  LPs 

Load-Max 

Same  calculation  technique  as  other  maximum  quantities. 

Loadstd 

Each  LP  load  compared  to  the  overall  average. 

tproc 

Sum  of  time  spent  by  each  LP  in  application  processing  for  an 
allocation 

tsync 

Sum  of  time  spent  by  each  LP  in  synchronization  processing 
for  an  allocation 

tBlock 

Sum  of  time  spent  by  each  LP  in  blocking  for  an  allocation 

tMaxRun 

Runtime  taken  by  last  LP  to  complete  processing. 

Speedup 

Time  for  biased  sequential  execution  /  time  for  parallel  alloca¬ 
tion  execution 

N ullSNum 

Total  number  of  nulls  actually  sent  by  all  LPs  during  a  simu¬ 
lation 

Nullsvust 

Total  number  of  nulls  posed  by  Get  Event  filter  with  the 
timestamp  of  the  earliest  event  of  the  VSIM  event  list 

Total  number  of  nulls  posed  by  Get  Event  filter  with  the 
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timestamp  of  the  earliest  input  channel  plus  the  minimum  de¬ 
lay 

Total  Nulls  Posed 

N llllS poSed 

Sum  of  Nullsvust  and  Nulls  Delay  Incorrectly  ignores  nulls  sent 
as  part  of  Post  Event  filter.  NullsPosed  will  exceed  Nulls  mm 
since  some  nulls  posed  will  not  increase  the  channel  time  so 
VHDLClocks94  will  not  send  them. 

Total  Reals  Sent 

RealsNum 

Count  of  real  messages  sent  by  all  LPs  in  a  simulation. 

3.5  Wallace  Tree 


The  majority  of  data  collection  and  model  formulation  is  based  on  the  same  Wallace  Tree 
Multiplier  used  by  Breeden  and  Kapp.  This  circuit  is  large  enough  to  permit  conclusions  while 
small  enough  to  wield  in  an  experimental  environment.  The  circuit  consists  of  1,050  behaviors  and 
1,770  arcs.  It  accepts  two  8  bit  input  vectors  to  form  their  12  bit  product. 


3.6  Associative  Memory 


The  associative  memory  circuit  developed  by  Kapp  consists  of 4,234  behaviors  and  9,312 
arcs  (Kapp,  1993:89).  It  is  the  largest  circuit  simulated  at  AFIT,  requiring  20-40  minutes  to 
simulate  on  the  iPSC/2.  Implementing  a  16x16  memory  array,  the  testbench  performs  several  read 
and  write  tests  including  pattern  searches.  Kapp  reports  that  design  conveniences  may  hinder 
simulation  performance.  A  common  enable  signal  causes  all  registers  to  schedule  events,  although 
only  one  may  be  of  interest  in  a  particular  cycle  (e.g.  read  versus  write  registers). 

3.7  Model  Structure 

The  cost  model  proposed  in  this  research  was  developed  by  collecting  several  hundred 
samples  of  Wallace  Tree  statistics  for  the  various  partitioning  strategies  on  2  through  8  processors. 
By  using  the  AB  Improvement  routine  a  representative  spectrum  of  values  for  all  parameters  was 
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pursued.  Blind  use  of  three  different  cost  models,  allowed  data  to  be  gathered  while  avoiding  bias 
to  a  particular  region  of  the  search  space. 


Runtime  performance  is  a  function  of  many  disparate  parameters  and  their  interrelation¬ 
ships.  In  order  to  add  theoretic  structure  to  the  model  development,  an  abstract  mathematical  de¬ 
scription  is  now  developed.  Since  processors  are  in  one  of  three  states  at  all  times,  the  sum  of  time 
spent  for  each  state  should  give  the  runtime  of  that  processor.  While  the  critical  runtime  is  the 
maximum  over  all  LPs,  the  sum  of  all  states  over  all  LPs  is  guaranteed  to  include  the  performance 
of  the  slowest.  Also,  if  LPs  are  generally  balanced,  the  sum  over  all  LPs  will  be  a  good  indicator 
for  the  performance  of  the  slowest  LP.  Rather  than  attempt  to  build  a  cost  model  for  runtime  di¬ 
rectly,  cost  models  were  built  for  the  component  states:  application  processing,  synchronization 
processing,  and  blocking.  Upon  determining  the  component  cost  models,  the  terms  of  each  were 
used  to  construct  the  overall  model.  These  subordinate  cost  models  provide  accuracy  and  reliabil¬ 
ity  data  for  each  component,  allowing  a  more  insightful  critique  of  the  composite  cost  model.  All 
cost  models  were  developed  in  accordance  with  the  procedure  outlined  in  Section  2.7. 

I  runtime  ~  fy)  +  S  ^Proc  ^2  ' 

All  LPs  AULPs  AllLPs 

Equation  8  Top  Level  Cost  Model 

3.8  Summary 

This  chapter  outlined  the  operating  environment  and  specific  test  cases  used  in  this  re¬ 
search.  The  three  major  software  systems  supporting  this  project  are:  SPECTRUM,  VSIM,  and 
the  Graph  Partitioning  Tool  v3.0.  The  specific  software  used  for  synchronization  management  is 
the  null  message  filter  VHDLClocks94  which  is  instrumented  to  measure  the  time  spent  by  the 
processor  in  each  of  three  states:  application  processing,  synchronization  processing,  and  block- 


thync  ^3  ' 


Block 
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mg.  Only  a  summary  description  of  the  support  software  behavior  is  presented  here.  Each  system 
has  several  thousand  lines  of  code  and  demands  an  in  depth  study  to  be  fully  understood. 

By  predicting  the  time  spent  in  the  three  component  states,  a  cost  model  is  formed  to  pre¬ 
dict  overall  runtime.  The  statistical  methods  used  exactly  follow  those  presented  by  Thorndike  and 
are  outlined  in  Section  2.7  of  this  paper.  The  analysis  is  primarily  performed  on  the  Wallace  Tree 
Multiplier  of  previous  AFIT  research.  The  cost  model  is  then  applied  to  Kapp’s  Associative 
Memory  circuit  to  assess  robustness. 
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4.  Data  Description  and  Analysis 


4.1  Model  Development 


This  section  presents  summary  forms  of  the  sample  data  collected  and  the  development  of 
the  cost  model.  Three  hundred  fifty  eight  simulations  for  the  Wallace  Tree  Multiplier  comprised 
the  entire  sample  set.  Each  data  tuple  was  randomly  assigned  into  one  of  two  sets,  Set  A  or  Set  B. 
All  cost  models  were  developed  from  each  dataset  independently  and  then  compared  in  accordance 
with  double  cross-validation.  Upon  validation  of  the  terms,  each  cost  model  was  redeveloped  over 
the  entire  sample  set.  To  avoid  ambiguity  between  the  various  cost  models  under  development,  the 
following  names  apply  to  this  discussion: 

Costpnc  :  predicts  total  time  of  all  LPs  spent  in  application  processing 

Costsync  :  predicts  total  time  of  all  LPs  spent  in  synchronization  processing 

Costgiock  :  predicts  total  time  of  all  LPs  spent  in  blocking 

CostRunUme  :  the  ultimate  model  of  this  research,  predicts  the  maximum  LP  runtime 

built  from  the  assumed  model  that 

CoStRuntime  —  bo"VblCOStproc'Vt)2COStsync’Vb3COStRlock 

Equation  9  CostRuntime  Top  Level  Model 
All  cost  models  use  a  priori  parameter  values  for  a  specific  allocation  as  input. 


Top  Level  Model 


Before  forming  cost  models  for  the  component  terms,  the  top  level  model  should  be 
validated.  A  linear  regression  of  Equation  8  over  the  entire  sample  population  results  in  the 
following  coefficient  vector 
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I 


-2.6688 

0.81 

0.091 

0.0907 


Thus,  measured  tProc,  tsync,  and  tBi0ck  can  be  used  to  form  a  CostRun,ime  linear  model  with  0.978 
correlation  and  average  error  of  prediction  of  1.006  seconds.  The  correlation  seems  solid  and  the 
absolute  error  measure  is  within  4%  of  the  average  runtime.  Therefore,  if  a  priori  parameters  can 
be  found  to  predict  tProc,  tsync,  and  tBiock,  the  component  cost  models  can  be  accumulated  to  predict 


tMaxR  un • 


4,1.2  Application  Processing  Model 

The  initial  step  of  model  development  is  identifying  a  candidate  set  of  terms  (parameters) 
pertinent  to  the  criterion  term  which  is  to  be  predicted.  An  initial  set  is  formed  from  the  parameters 
that  demonstrate  significant  correlation  to  the  criterioa  Appendix  B  presents  multiple  correlation 
data  for  all  parameters  in  the  sample  set.  From  this,  ArcsNuminterLP,  Arcs Stdw to*,  Arcsstmun,  and 
LookaheadAvg  are  added  to  the  candidate  set.  The  Costpnc  model  offers  the  greatest  potential  for 
prediction  accuracy  since  tProc  is  defined  to  be  independent  of  the  complex  synchronization 
protocol.  It  is  logical  that  load  and  load  balance  will  influence  the  time  spent  in  application 
processing.  Consequently,  LoadMax,  LoadAvg,  Loadsn  are  added  to  the  candidate  set.  An  LP’s 
application  processing  time  should  be  fundamentally  dependent  on  the  number  of  events  it  has  to 
process.  A  crude  complexity  analysis  can  be  accomplished  by  calling  E  the  set  of  all  events  that 
will  be  processed  in  a  simulation.  Note  that  E  is  constant  over  all  allocations.  In  order  to 
accomplish  E,  VSIM  must  enqueue  and  process  all  events  over  the  course  of  the  simulation.  The 
implementation  of  the  next  event  queue  therefore  has  direct  relevence  to  the  performance  of  the 
simulation.  Insertion  into  the  next  event  queue  sort  is  0(N )  in  the  worst  case  where  N  =  |£| . 
Consequently,  the  entire  simulation  performs  0(1V2)  of  application  processing.  Assuming  uniform 
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gate  activity  and  balanced  loads  on  each  partition,  the  amount  of  application  processing  for  each 

LP  should  be  O  ( As ).  Note  that  this  implies  that  application  processing  can  achieve  superlinear 

p 

speedup  since  0(  Aj)  is  less  than  0(N2/p).  However,  this  was  an  extremely  crude  analysis  which 
P 

should  not  be  taken  too  literally.  Its  purpose  was  to  identify  additional  terms  to  add  to  the 
candidate  set.  Combinations  of  the  Up  term  are  included  to  complete  the  candidate  set  of  terms  for 
Costpmc •  The  candidate  set  is: 

{  ArcsstM’tOut,  Arcsstdwtin,  L/)okdhecidAvg,  ktOddf^axi  Loddsuu  1  /[kPswum  log(Lf>.Sjyum)], 
l/LPSNum,  lOg(LPSNum),  \/LPSNum). 

Note  that  LoadAvg  was  dropped  from  the  candidate  set  Since  the  Vertices Num  is  constant  between 
allocations,  VerticesNuJLPsNum  offers  no  more  information  than  just  l/LPs^um-  The  corresponding 
coefficient  of  the  linear  model  will  incorporate  the  VerticesNum  constant. 

The  mathematical  software  used  for  cost  model  development  is  MathCad  5.0+  for 
Microsoft  Windows.  Datasets  are  arrays  where  each  row  is  a  single  sample  instance  and  each 
column  is  a  specific  parameter.  A  (row, column)  pair  identifies  the  value  of  a  single  parameter  for 
a  single  simulation  run.  For  example,  SetA2oxx,amax  is  the  Loaduax  value  for  the  twentieth  sample 
of  Set  A.  Since  not  all  parameters  are  considered  during  model  construction,  a  subset  array  X  is 
the  data  array  of  the  candidate  set.  Let  x  be  the  array  including  a  specific  subset  of  candidates. 
Lety  be  the  sample  vector  of  the  criterion  parameter.  The  development  of  Costproc  follows: 
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Initialize  candidate  data  array: 


i:=  O-.rows(SetA)  -  1 


V=1 


v-Lr — f 

\=’  i.LPsNum/ 


Xi  5  :=  - 

‘•5  \  SetA. 


lg(x)  := 


log(x) 

log(2) 


Xi,2;= 


Xi,3  :=  SetAi,LoadStd  Xi,4 :=  SetAi,LoadMax 


i.LPsNum  / 


\SetAi,LPsNum/ 

Xi,6;=  SetAi,ArcsWtInterLP  Xi,7 :=  SetAi,  ArcsStdWtln  Xi,8  :=  SetAi,ArcsStdWtOut 


Xi|9  SetAi,LookaheadAvg  Xi,10 ;=  Xi,l'Xi,5 


y»;=  SetAi,tProc 


Each  row  of  X  has  candidate  parameter  values  for  a  single  simulation.  Each  column  of  X  has  the 
values  over  all  SetA  runs  of  a  specific  candidate  parameter.  The  first  column  of  X  is  initialized  to 
a  vector  of  Is  to  keep  p  and  X  of  compatible  dimensions. 

Choose  the  first  parameter  to  add  to  the  model: 


T 

r  = 


IIB 

WMWSMBiWIM 

Si 

6  j.7 

SMWM 

SHI 

m 

!  0.972  0.98  -0.444  -0.465 

1 0.951 

-0.21  lj  0.586 

00 
«”“ > 

© 

cn’ 

oo 

cn 

d 

-0.973 

Vector  r  is  the  multiple  correlation  vector  for  each  X  column  and  the  y  vector.  The  second  column 
of  X  carries  the  greatest  correlation  to  the  criterion  parameter  so  the  initial  linear  regression 
equation  is 

COStpr^  =  Pq  Pi  '  X2  0r  COStft  oc  =  P0  +  P  !  '  „  . 

0el/lU’sNwn 

Matrix  x  is  used  to  collect  the  selected  columns  of  X  into  a  single  matrix. 

xi.0:=1  Xi,l:=Xi,2 
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f>  -  (xT*)' 

1  T 

•x-y  p 

p  :=  cols(x) 

p=2 

k  :=  p  -  1 

k  =  1 

n:=  rows(x) 

n  =  179 

12.312 

50.264, 


SSr  := 


„  T  T 
P  -x  -y- 


2 

O 

1! 

n 

SSe  :=  yT-y-  pT-xT-y 


MSe  := 


0,0 


SSe 


n-  k  -  1 


0,0 


SSr 


MSe 

F=  4.196*  103 
Jme  =  1.313 
SSr02  :=  SSr 


Fstat  -  Fa  k  n.p-  2.71 


The  F  statistic  is  used  to  verily  that  the  model  is  not  just  coincidentally  related  to  the  criterion. 
Using  an  a  of  0. 1,  the  calculated  F  must  be  greater  than  Fstat  to  reject  the  null  hypothesis.  The 
large  number  of  samples  encourages  that  any  relationship  observed  is  likely  to  actually  exist.  The 
MSe  is  the  mean  squared  residual  error.  Thus,  on  average,  the  model  will  predict  MSe  from 
observed  criterion  values  due  to  error  not  accounted  for  in  the  regression.  SSr02  saves  the  amount 
of  error  accounted  for  in  the  regression  for  later  marginal  comparison  with  a  larger  model.  To 
select  the  next  parameter  to  add  to  the  model,  each  remaining  candidate  will  be  included  with  the 
current  model,  individually. 

Add  additional  terms  to  the  model: 
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x  :=  0 

Xi,  0  :=  1  \  1  :=  Xi,  2  \ 2  ;=  Xi,  6 

3  ;=  (xT-x)  -x^y 

p  :=  cols(x)  p  =3 
k  :=  p  -  1  k  =  2 

n  :=  rows(x)  n  =  179 

n-  1 

E* 

i=  0 
n 

SSr 

F  :=  Fstat  =  F  k n-=  2.30 

MSe  a,K'  p 

F  =  6.23*  103 
^MSe=  0.773 
SSr026  :=  SSr 

In  the  above  calculations,  testing  each  remaining  candidate  as  another  term  in  the  model  resulted  in 
the  F  statistics  in  Table  4.  Note  that  X2  was  not  considered  for  addition  again.  Since  X6  resulted  in 
the  highest  F  statistic,  it  is  selected  for  addition  to  the  model.  The  addition  of  a  new  parameter  to 
the  model  raises  the  question  of  the  marginal  contribution  of  that  parameter  in  lieu  of  those  already 
present.  Additionally,  does  the  addition  of  this  new  term  negate  the  necessity  for  other  terms 
already  in  the  model?  Using  the  partial  F  test  of  Section  2.7,  each  term  is  removed  from  the  model. 
The  marginal  decrease  in  SSr  is  tested  for  significance  as  follows: 
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Candidate  Parameter 

Resulting  F  statistic 

X, 

2188 

x3 

2463 

X4 

2486 

X5 

2213 

Xd 

6230 

X7 

2395 

x8 

2772 

x9 

3771 

x10 

2143 

Table  4  F  Statistic  Comparison  for  Parameter  Selection 


xi,o:=1 


x.  ,  :=  X.  , 
1,1  1,6 


p  :=  cols(x)  p=2 
k  :=  p  -  1  k  =  1 
n:=rows(x)  n=179 


I  26.762\ 
1-0.004  / 


SSr  := 


n-  1 


SSe  :=  yT-y-  pT-xT-y 


MSe  := 


0,0 


SSe 


n-  k-  1 


o,o 


Fp:= 


SSr026  -  SSr 
MSe 


Fstat  -  Fa>k,n-p~  2  71 


Fp  =  174.419 


As  seen  in  the  Fp  statistic,  the  SSr  of  the  model  with  X2  and  X6  is  significantly  more  than  just  using 
X6.  Therefore,  both  terms  should  stay. 
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The  model  continues  to  be  developed  by  using  the  F  statistic  to  select  from  the  remaining 
candidate  set  and  then  test  the  significance  of  each  term  in  the  model.  Soon,  additional  terms  do 
little  to  improve  the  accuracy  of  the  model  upon  which  the  process  is  concluded.  For  the  SetA 
development  of  Costpr0c,  the  model  is: 


C0St  fr,,,.  —  Po  +  Pl'(rD„  )+  P2  '  i^rCS  NtmlnltrLP )  +  P3  '  (ArCJffllSW,1)  +  P4  '(  Ps  '  SldWtOul) 


LPsn 


LPsk 


Equation  10  CostProc 

'  10.185' 

37.734 

0.004 

P  =  1 

0.041 

21.456 

0.005 


corr(  CostProc,  y)  =0.995 


l 


MSe  =0.633 


■ 


As  can  be  seen,  the  CostPmc  for  SetA  is  very  successful  in  both  correlation  and  absolute  error.  The 
identical  form  of  the  model  was  developed  over  SetB  with  the  following  coefficients  and  summary 
statistics: 


10.135 

39.92 

0.004 

0.039 

15.203 

0.011 


i 


corr(CostProc ,  y) 


MSe  =0.656 


i 
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Double  cross-validation  requires  that  the  p  for  SetA  be  used  for  SetB  and  visa  versa.  Upon 
performing  this, 


SetA 

SetB 

pa 

Corr  =  0.995 

0.633 

Corr  =0.995 

1.895 

pb 

Corr  =  0.995 

4MSe=  1.632 

Corr  =  0.995 

4MSe=  0.656 

Although  the  absolute  error  increased  in  cross-validation,  both  the  form  and  the  correlation  of  the 
model  applied  well.  The  model  is  validated.  Regressing  this  model  on  the  entire  sample  set  yields: 


10.109 

38.75 

0.004 

P  = 

0.041 

18.267 

0.008 

coir  (CostProc ,  y)  =  0.995  JmSc  =  0.65 1 


4.1.3  Synchronization  Processing  Model 

Synchronization  costs  present  a  much  more  formidable  challenge  in  identifying  candidate 
parameters.  Correlations  are  weaker  and  fewer  than  with  tPr0C.  Since  tSync  is  a  function  of  many 
parameters,  it  becomes  difficult  to  interpret  two  dimensional  projections  of  the  k- space  within 
which  the  tSync  criterion  is  varying.  Figure  13  presents  a  repeating  projection  pattern  for  both  tSync 
and  tfjiocb  Although  the  relationship  is  essentially  linear,  two  disparate  lines  demonstrate  the 
influence  of  other  parameters.  Some  other  parameter  is  discretely  (not  continuously)  affecting  the 
slope  of  the  tsync  versus  parameter  relationship.  By  sorting  the  sample  set  on  the  independent 
variable  (parameter x)  and  then  observing  consistent  changes  in  other  parameters  of  the  dataset,  the 
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terms  varying  with  slope  can  be  identified.  In  the  case  of  LPsnum  the  companion  parameter  is 
ArcsNtimIn,erLp,(S>ct  Figure  14).  By  using  this  technique,  the  final  form  of  the  CostSync  model 
performs  very  well  over  the  sample  set 


2  3  4  5  6  7  8 


DataSeti,LPSNum 


Figure  13  tSync  versus  LPsNum 


C°ShyK  P  0  3 1  '  (^Ita  ‘  ^rCS NitvlmrLP  )  +  P  2 ' 0-^>SNun  )  P  3  '  (ArCS SldWlOut )  +  P  4  '  ( ,LookaheadAvg  )  +  3  j  •  (LPsNml ) 


Equation  11  CostSyne 


For  SetA 


16.386 

0.005 

0.453 

-0.094 

-0.51 


i 


-1.77 


For  SetB 


12.001 

0.006 

0.489 

-0.06 

-0.285 


i 


-1.801 


Cross-validation 


SetA 

SetB 

Pa 

Corr  =  0.994 

jMSe=  2.107 

Corr  =0.987 

4m&=  7.657 

Pb 

Corr  =  0.993 

VMSe=7.773i 

Corr  =  0.99 

yjMSe=  2.305 

While  not  as  accurate  as  CostPr0c,  Costsync  shows  good  correlation  and  absolute  error  within  9%  of 
the  average  tSync.  Cross-validation  kept  good  correlation,  but  absolute  error  jumped  to  30%  of  the 
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average  tSync.  This  suggests  that  the  ordinal  scaling  of  Costsync  is  correct,  but  its  ability  to  retain 
interval  information  is  poor. 

For  the  whole  sample  set: 


13.787 

0.005 

0.462 

P  = 

-0.078 

-0.38 

-1.639 

corr  ( CostSync,  y)  =0.992  ^MSe  =2.244 


4,1.4  Blocking  Model 

Modeling  blocking  proves  to  be  the  most  difficult  of  the  three  components  to  predict. 

While  similar  difficulties  as  with  CostSync  are  encountered,  where  the  CostSync  issues  are  resolved, 
Costsiock  remains  a  problem.  The  combining  technique  of  the  previous  section  failed  to  collapse  the 
disparate  linear  relationships  into  one.  Additionally,  tsiock  varies  over  a  much  larger  range  resulting 
in  a  similarly  large  discrepancy  between  variance  due  to  regression  terms  and  total  variance. 
Additionally,  the  models  formed  on  SetA  and  SetB  failed  to  include  the  same  terms.  Without 
insight  into  alternative  parameter  combinations,  both  models  were  cross  validated  and  the  best 
chosen  for  regression  over  the  whole  sample  set.  Note,  the  chosen  model  failed  double  cross- 
validation,  but  is  the  best  one  available. 

1 

CostBhct  =  P0  +  Pj  •  (LPSf,^  )+  P2  •  (LPsNltn  ■  AfcsNmj^/tP)+  P3  •  (Lookahead Avg)+  p4  •  (e  g)+  $5  ■  (LPs,^) 


For  SetA 


Equation  12  Costsiock 
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p  = 


4.758*  106 
5.503 
0.007 
-2.65 


-4.758*  106 
4.52*  103 


For  SetB 


P  = 


2.873*10 

4.629 

0.002 

-1.294 

-2.873*  106 
2.73 1*103 


■ 


Cross-validation 


SetA 

SetB 

Pa 

Corr  =  0.987 

SfSe=  11.186 

Corr=  0.978 

VMSfe=  75.604 

pb 

Corr  =  0.984 

‘jMSe=  53.353 

Corr  =  0.98 

12.767 

For  whole  sample  set 

4.307*  106 
5.257 
0.005 
^  =  -2.107 

-4.307*  106 
.  4.093*  103  . 

corr(CostBlock,  y)  =  0.983  Jme  =  12.229 


While  the  correlation  remains  impressive,  the  average  error  of  prediction  is  now  15%  of  the 
average  tsiock ■  In  cross-validation  that  error  jumps  to  over  90%.  This  model  is  seemingly 
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unreliable  with  regard  to  interval  scaling.  Hopefully,  that  phenomenon  will  dissolve  in  the 
composite  model  for  runtime. 


4.1.5  Composite  Model 


The  composite  model  development  for  CostRUntime  places  all  the  terms  of  the  component 
models  into  the  candidate  set.  The  candidate  matrix  is  initialized  by: 


x<r=1 

Xi,l :=  DataSet  LPsNum 

Xii2:=(DataSetiLpsNum)2 

1 

X  3  :=  f - - - ) 

1  ^DataSet  LPsNumy 

Xi,4  :=  DataSeti,ArcsStdWtOut 

DataSet.  T 

X.  ,  :=  (e)  1'U)adAvg 

Xi,6  :=  DataSeti,LookaheadAvg 

Xi’?'  (DataSelLPsNj 

X.  0  :=  DataSet  T  n  XT  -DataSet  A  XT  T  *  T  i>  X.  n  :=  DataSet  A  t  T  0 

i,8  i,  LPsNum  i,  ArcsNumlnterLP  i,9  i,  ArcsWtlnterLP 

Xi,  10  ;=  DataScti,  ArcsStdWtln 

The  correlation  vector  of  these  parameters  with  observed  tMaxRun  is 


I 

11 

• 

2 

1  1  4  1  5 

f. 

7 

* 

9 

0  0 

iS^OTj 

0.60' 

1  -0.088  j-0.446  0.508 

0.081 

-0.217 

0.67  0.475  -0.375 

As  can  be  seen,  the  correlations  are  not  nearly  as  strong  as  those  observed  for  CostPr0C.  The  model 
development,  however,  is  surprisingly  well-behaved.  Both  SetA  and  SetB  select  identical  terms  for 
the  model. 

Cost ^  =  Po  +  p!  ■  (LPsf/J)+  p2  ■  (—]—)+  P3  •  (LPs^  ■  ArcsNmJlMrU,)+  P4  •  (Lookahead Avg)+  P5  •  (Arcssumi(M) 

L*"Nun 


For  SetA 


Equation  13  CostRUntime 
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7.908 

0.251 

52.273 

0.002 

-0.39 

-0.013 


For  SetB 


Cross-validation 


7.701 

0.264 

47.171 

0.001 

-0.236 

-0.002 


■ 


SetA 

SetB 

Pa 

Corr  =  0.935 

-Jm&=  1.88 

Corr  =  0.868 

4Me=  2.962 

Pb 

Corr  =  0.919 

4MSe=  5.926 

Corr  =  0.872 

2.195 

For  whole  sample  set 

'  7.908  ' 

0.251 

52.273 

P  = 

0.002 

-0.39 

-0.013 

corr(CostRuntimey)  =0.935  ^/MSe  =  1.88 
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4.1.6  Minimum  Delay  Revisited 


Unfortunately,  the  CostRuntime  model  raises  an  issue.  The  minimum  delay  has  not  been 
resolved  for  very  large  circuits  like  the  Associative  Memory.  Dijkstra’s  would  require  a  64  Mb 
data  structure  on  disk  while  the  dynamic  search  performs  too  slowly  for  iterative  improvement  of  a 
4,000  node  circuit.  It  would  have  been  convenient  if  the  cost  model  allowed  ignoring  Lookahead 
which  would  prevent  the  need  to  find  the  minimum  delay.  Unfortunately,  the  issue  must  be  forced 
by  redeveloping  CostRun,ime  without  LookaheadAvg  as  a  candidate.  Let  this  cost  model  be  labeled 
CostRuntime  and  it  is 

—  P0  +  Pi  ‘(LPSNun)+  P2  •(  £•)+  p3  ■  (LPSfom  '  Arcs NunlnUrLp) 

L‘SNun 

Equation  14  CostRuntime 


for  the  whole  sample  set 


5.852 

0.298 

3  = 

44.196 

0.001 

corr  ( CostxRuntims  y )  =  0.894  ^MSe  =  2.176 


While  the  correlation  has  diminished  from  the  component  cost  models,  it  still  appears  well 
correlated  and  average  error  is  less  than  8%  of  the  average  runtime.  Except  for  the  limited  degrees 
of  freedom,  there  is  no  reason  to  doubt  this  model.  Note  that  the  only  pertinent  parameters  are 
manipulations  of  LPsNum  and  ArcsNuminurLP-  There  is  a  linear  term  that  reflects  the  increase  in 
communication  as  LPsNum  increases  as  well  as  a  inverted  square  term  that  decreases  application 
processing  time  as  LPsNUm  increases.  The  few  unique  parameters  legitimize  their  importance  in 
accelerating  VHDL  simulations,  but  imply  the  model  is  making  assumptions  regarding  other 
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Biased  Speedup 


parameters.  LoadAvg  is  an  obvious  influence  to  runtime;  however,  when  developing  over  a  single 
circuit,  the  term  is  redundant  to  l/LPs^um  since  VerticesNum  is  constant.  Since  no  load  balancing 
term  remained  in  the  cost  model,  it  is  likely  the  sample  set  failed  to  expose  enough  variance  in 
Loadstd  to  introduce  it  into  the  model.  The  following  section  presents  speedup  curves  for  the 
various  types  of  partitioning. 


— ♦ —  Linear 
~~M~~  Breeden  Random 
— SBF 

Kapp  SBF  Annealing 
-3K--  Cost  Runtime 
■■  Cost  xRuntime 


4.2  Model  Verification 


Two  hundred  eight  additional  runs  comprise  the  validation  set.  All  cost  models  were 
evaluated  on  their  ability  to  improve  initial  partitions  via  AB  Improvement.  Figure  15  shows  the 
results  of  the  AB  Annealing  routine  using  the  CostRmtime  and  Cost^mim  models  for  simulation  of 
the  Wallace  Tree  Multiplier.  Performance  is  disappointing  and  difficult  to  interpret.  Foremost, 


64 


Number  of  LPs 


Figure  16  Sum  of  LP  Activity  versus  Number  of  LPs 

both  statistical  models  show  conflicting  results  in  contrast  with  the  Kapp  model.  A  margin  is  given 
to  the  Kapp  model  in  that  instrumentation  of  the  three  processing  states  adds  about  a  10% 
performance  overhead.  This  overhead  would  not  be  realized  in  sequential  execution  since  the  filter 
is  not  called  and  consequently  diminishes  speedup.  However,  both  models  developed  in  this  thesis 
demonstrate  erratic  speedup  patterns  with  isolated  points  of  extremely  good  or  extremely  bad 
performance  relative  to  the  preceding  curve.  Inspection  of  the  data  reveals  that  continued  iteration 
using  either  cost  model  may  result  in  performance  degradation.  CostRUntime  frequently  lessens 
speedup  over  the  SBF  partition  on  which  it  was  based. 

The  next  step  involves  identifying  the  sources  of  error  for  the  cost  models.  Figure  16 
demonstrates  the  behavior  of  the  validation  set  as  the  number  of  LPs  increases.  Note  that 
cumulative  processing  time  decreases.  In  other  words,  partitioning  the  simulation  requires  less 
total  work  to  be  accomplished  and  provides  the  opportunity  for  superlinear  speedup.  Of  course, 
partitioning  the  sequential  implementation  would  also  realize  this  reduction  in  overall  work. 
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Blocking  seems  to  dominate  LP  performance  beginning  with  4  LP  partitioning.  Next,  each 
component  cost  model  is  compared  to  appropriate  values  of  the  validation  set.  CostPr0c  maintains  a 
correlation  of  0.993  with  tproc  (see  Figure  17  CostProc  versus  tProe).  Costsync  and  CostBi0ck  report 
equally  impressive  correlation  with  their  observed  counterparts.  Correlation  for  CostSync  and  tsy„c  is 
0.95 1  and  correlation  for  CostBhck  and  tBu,ck  is  0.944.  Despite  this  prowess  at  predicting  time  spent 
in  the  component  states,  CostRuntime  and  CostxRuntime  deliver  correlations  of  0.722  and  0.727 
respectively.  Somehow  the  sum  of  the  parts  is  not  equaling  the  whole.  An  interesting  phenomenon 
occurs  when  comparing  the  top-level  component  model  (Equation  8)  using  observed  components 
and  then  calculated  components  (Equation  9).  When  using  observed  data  of  the  validation  set 
Equation  8  exhibits  a  correlation  of  0.954  with  tMaxRun •  However,  when  using  the  component  cost 
models,  Equation  9  demonstrates  a  correlation  of  0.765  to  tMaxRun.  Correlation  (>-)  fails  to  survive 
through  the  linear  model!  Put  in  symbolic  terms: 

a  >■  a'  b  >-  b'  c>~c' 
f(af,b',c')>-Y  but 
f(a,  b,c)y-Y 
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(secs) 


Of  course,  the  final  cost  model  does  not  include  the  component  cost  models,  only  the  terms 


Trial 


Figure  18  Costs yne  versus  tSync 
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250 

Figure  19  Costsiock  versus  tBi„Ck 

indicated  by  the  component  cost  models.  The  ultimate  failure  lies  with  the  composite  cost  models’ 
(CostRuntime  and  CostxRuntime)  inability  to  predict  the  complex,  multi-dimensional  surface  of  the 
runtime  versus  allocation  relationship.  While  the  underlying  linear  model  may  be  an  incorrect  form 
of  parameter  relationship  to  runtime,  another  likely  insufficiency  is  the  choice  of  parameters  for  the 
model.  Fundamental  influences  to  the  performance  of  VHDL  simulations  are  not  included  in  the 
current  models  (e.g.  event  balance,  lookahead,  feedback).  Furthermore,  the  simple  heuristics 
implied  by  a  3-6  term  equation  cannot  accurately  predict  the  pitted  surface  this  and  previous 
research  seeks  to  model.  As  demonstrated  in  Figure  21,  Kapp’s  theoretic  model  also  fails  to 
correctly  order  allocations  for  just  the  case  of  6  LP  simulations  of  the  Wallace  Tree  multiplier. 
Unpredictable  results  can  be  expected  in  application  to  other  circuits  as  well.  Note  that  the  models 
are  not  necessarily  inaccurate,  their  usefulness  is  simply  limited  to  an  unacceptably  small  and 
undefined  portion  of  the  search  space. 
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Figure  20  Statistical  CostModels  versus  tuaxRun 
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5  10  15  20  25  30 

Kapp  Partial  Model 


Figure  21  6LP  Wallace  Tree  runtime  versus  Cost  Models 
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4.3  Model  Extrapolation  ( Assoc  Mem ) 


Figure  22  demonstrates  the  mediocre  performance  of  the  cost  model  for  another  circuit. 
Here,  however,  the  Cost^umime  suffers  a  handicap  since  the  simple  DFS  partition  has  no  induced 
feedback  (see  next  section).  Ignorant  of  the  influence  of  feedback,  Cost^mUmt  creates  the  problem. 
The  no-feedback  BFS  allocation  displays  the  destructive  effect  of  pipeline  initialization  costs 
associated  with  that  technique  (see  next  section).  The  true  curiosity  of  this  graph  is  the  uppermost 
curve  which  is  a  simple  DFS  partition.  Simple  data  partitioning  of  graphs  was  enhanced  in  GPT 
v3.0.  By  keeping  strongly  connected  components  of  the  graph  together  during  depth  first 
partitioning,  GPT  v3.0  coincidentally  creates  an  allocation  with  no  induced  feedback.  This 
radically  improves  the  simulation  runtime  as  discussed  in  the  next  section. 
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4.4  Feedback 


The  effect  of  feedback  on  conservative  simulations  has  been  qualitatively  understood  if  not 
quantitatively.  Cycles  in  the  problem  graph  cause  participating  LPs  to  proceed  in  lockstep  fashion. 
No  member  can  proceed  until  the  feedback  of  its  last  message  propagates  around  the  cycle.  Kapp 
identifies  strongly  connected  components  within  the  problem  graph  and  avoids  breaking  them  over 
multiple  LPs.  Contracting  the  problem  graph  into  strongly  connected  components  results  in  an 


acyclic  directed  graph.  Another  form  of  feedback  is  caused  by  the  contraction  of  the  problem 
graph  into  the  LP  graph  where  it  is  possible  to  induce  feedback  on  a  feedforward  network  (Mannix, 
1988:3-19).  Figure  23  presents  an  example  of  induced  feedback.  Oddly  enough,  Simple  Depth 
First  and  Simple  Breadth  First  allocations  of  strongly  connected  components  both  fail  to  guarantee 
prevention  of  induced  feedback.  A  general  algorithm  developed  in  this  research  uses  the  strongly 
connected  component  contraction  of  the  problem  graph  to  further  contract  onto  LPs  without 
inducing  feedback. 
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No  Feedback  Breadth  First  Allocation 

For  each  vertex  (v)  define  d(v)  where 
d (Source)  =  0  and 

d(v)  =  max(d(w,))+l  for  all  vertices  to  which  v  is  adjacent  -  i.e.  (w,-,v)  is  an  edge. 

Perform  a  Breadth  First  Search  as 
Initialize 

color(v)<— white  for  all  v 
Q  <r-  {all  sources} 
i  4 —  0 

while  Q  is  not  empty  loop 
v  <—  head(0 
D  <-  d(v)  +  1 
for  each  u  e  Adj(v)  loop 

if  color(«)  is  white  and  d(«)  =  D  then 
color (n)  <—  gray 
Enqueue((2,n) 

end loop 
Dequeue(Q) 

LP/< — v 

if  siz e(LP,)  >  NumVertices/NumLPs  then 
i  4 —  (i  +  1) 

end  if 

coloifw)  4-  black 
end  loop 


Figure  24  is  the  resulting  partition  of  applying  the  algorithm  to  the  graph  of  Figure  23. 

Applying  this  technique  to  the  Wallace  Tree  Multiplier  resulted  in  the  speedup  curve  of  Figure  25. 
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Biased  Speedup 


Superlinear  speedup  is  achieved  through  a  7  LP  partition!  As  can  be  seen  in  Figure  26,  the 
elimination  of  feedback  radically  changes  the  component  influences  to  runtime.  Blocking  is  now  a 
minority  influence  through  8  processors.  The  algorithm  fails  in  that  it  forms  an  allocation  that 
behaves  like  a  pipelined  system.  There  is  an  initialization  cost  to  “fill”  the  pipeline  resulting  in  the 
end  processor  having  significantly  more  idle  time  than  the  first.  This  initialization  cost  can  be 
amortized  to  insignificance  by  increasing  the  number  or  size  of  simulation  test  benches.  However, 
the  detrimental  effect  of  “filling  the  pipe”  can  be  seen  in  comparing  the  DFS  and  BFS  No- 
Feedback  partitions  of  the  Associative  Memory  circuit  in  Figure  22. 


~-"3£ —  Breeden  Random 
— *—  Kapp  SBF  Annealing 
No  Feedback 


Figure  25  No  Feedback  Speedup  of  Wallace  Tree 


While  the  identification  and  exploitation  of  feedback  allows  a  new,  pertinent  parameter  to 
enter  a  cost  model,  a  new  cost  model  must  be  completely  redeveloped  due  to  the  drastic  changes  in 
the  behavior  of  the  circuit.  Furthermore,  currently  there  is  only  a  nominal  measure  of  feedback: 
presence  or  absence.  Feedback  may  have  a  graduated  influence  which  would  require  the  definition 


and  measurement  of  a  feedback  metric. 


Sum  of  States  (secs) 


5.  Conclusions  and  Recommendations 


5.1  Research  Summary 

As  modem  digital  circuits  grow  larger  and  more  complex,  the  time  required  to  perform 
sequential  simulations  becomes  unacceptably  slow.  Since  simulation  is  a  vital  input  to  the  design 
and  validation  of  circuits,  this  bottleneck  affects  the  efficiency  of  the  entire  development  cycle. 
Parallel  simulation  offers  a  solution  that  scales  with  the  problem.  By  assigning  circuit  components 
to  distributed  processors,  the  work  of  the  simulation  can  be  divided.  There  is,  however,  an  addi¬ 
tional  cost  of  synchronization  between  the  cooperative  processors  not  present  in  sequential  simula¬ 
tion.  The  manner  in  which  circuit  components  are  partitioned  among  processors  greatly  influences 
the  amount  of  overhead  incurred.  The  task  is  to  partition  intelligently  such  that  computational 
parallelism  is  not  overwhelmed  by  synchronization  overhead. 

In  this  research  effort  heuristic  techniques  of  intelligent  partitioning  were  considered.  By 
observing  trends  of  successful  partitions,  a  statistical  relationship  of  a  priori  parameters  to  parallel 
simulation  runtime  was  developed.  Formal  definition  of  this  relationship  in  the  form  of  a  cost 
model  allowed  allocations  to  be  ordered  by  predicted  runtime.  By  choosing  the  allocation  with  the 
lowest  cost  model  value,  the  simulation  using  that  allocation  was  expected  to  have  die  lowest  run¬ 
time  of  the  set  considered.  “The  set  considered”  is  an  important  distinction  because  the  mapping  of 
tasks  to  processors  to  achieve  the  lowest  runtime  is  a  known  NP-Complete  problem.  Finding  an 
optimal  solution  is  intractable;  finding  relatively  good  solutions  is  desirable.  The  set  of  candidate 
allocations  is  chosen  by  using  Kapp’s  AB  Improvement  iterative  search  procedure.  This  algorithm 
moves  vertices  on  partition  borders  to  other  LPs  until  moves  fail  to  improve  the  cost  model.  The 
best  allocation  is  then  used  for  simulation. 
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A  1,050  gate  Wallace  Tree  Multiplier  was  used  to  develop  and  validate  the  statistical  cost 
models  proposed.  The  4,243  gate  Associative  Memory  circuit  was  used  to  test  the  general  appli¬ 
cability  of  the  cost  models. 

5.2  Conclusions 

The  following  conclusions  about  partitioning  VHDL  circuits  for  parallel  simulation  result 
from  this  research: 

•  Partial  ordering  of  allocations  by  runtime  is  still  not  possible  via  proposed  models.  The  re¬ 
lationship  that  the  proposed  cost  models  attempted  to  emulate  is  that  between  runtime  and  allo¬ 
cation.  For  any  circuit,  this  surface  can  be  imagined  in  three  dimensions  where  each  allocation 
is  a  point  in  the  xy  plane  and  the  corresponding  observed  runtime  is  plotted  along  the  z-axis.  If 
total  knowledge  were  available,  a  desired  cost  model  would  efficiently  follow  this  surface  using 
a  priori  parameters.  Of  course,  this  target  surface  has  many  more  dimensions  than  three.  Ad¬ 
jacent  allocations  differ  by  the  placement  of  only  one  behavior.  There  are  VerticesNum  avail¬ 
able  for  placement  in  LPsNum-l  other  LPs,  thus  the  surface  has  something  less  than 
VerticesNum9(LPsNum  -1)  dimensions.  While  some  dimensions  may  be  constant,  it  is  easy  to  see 
the  inherent  intractability  of  effectively  modeling  this  complex  relationship.  As  demonstrated 
in  this  research,  current  models  fail  to  partially  order  allocations  under  the  most  restrictive  of 
conditions.  Further  restrictions  would  make  successful  conclusions  uninteresting. 

•  Good  cost  models  are  those  that  demonstrate  stochastic  reliability  and  efficiency,  not  correct 
partial  ordering.  This  paper  does  not  suggest  the  abandonment  of  simulation  cost  models. 
Most  intelligent  allocations  exceed  the  performance  of  random  partitioning.  Instead  of  attempt¬ 
ing  to  order  allocations,  cost  models  that  demonstrate  good  correlation  to  observed  runtime 
should  be  pursued.  Scatter  plots  of  observed  runtime  versus  cost  model  results  would  assist  in 
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determining  the  form  of  the  model.  Correlation  and  regression  measurements  would  allow 
quantitative  assessment  of  the  model.  Computational  effort  should  also  be  considered  in 
evaluating  cost  model  merit 

•  Statistical  model  development  is  an  effective  technique .  The  three  component  cost  models 
proved  very  accurate  for  predicting  tProc,  tsync ,  and  tfi;oc*for  the  validation  set.  Unfortunately, 
these  predicted  parameters  do  not  hold  much  value.  Moreover,  the  original  component  cost 
models  will  likely  fail  when  applied  to  no-feedback  allocations.  But,  under  the  restrictions  of 
feedback-littered  simulation  of  a  Wallace  Tree  Multiplier,  the  models  were  accurate.  The  key, 
as  with  any  model,  is  identifying  all  pertinent  parameters  and  the  scope  of  application.  Given 
this  information,  statistical  model  development  was  able  to  accommodate  and  offer  insight  to 
the  nature  of  inter-parameter  relationships. 

•  Feedback  is  a  major  contributor  to  conservative  simulation  blocking.  The  concept  of  explicit 
feedback  was  addressed  by  Kapp’s  use  of  strongly  connected  components.  An  efficient  algo¬ 
rithm,  0(VerticesNum),  is  presented  as  part  of  this  research  to  guarantee  no  feedback  between 
LPs.  In  the  case  of  the  Wallace  Tree  Multiplier,  this  improvement  reduced  blocking  from  an 
exponential  influence  (as  LPsNum  increases)  to  a  linear  one.  Yet,  again,  no  single  parameter  is 
the  panacea  of  VHDL  circuit  partitioning.  Feedback  must  be  measured  and  included  in  any  ef¬ 
fective  cost  model. 

•  Conservative  Parallel  VHDL  Simulation  can  achieve  super-linear  speedup.  Speedup  curves 
for  both  the  Wallace  Tree  Multiplier  and  the  Associative  Memory  demonstrated  speedup  to  8 
processors  (the  limit  of  this  research).  This  verifies  that  the  complexity  of  VHDL  simulation 
has  some  form  of  complexity  such  that  real  work  reduces  as  the  number  of  LPs  increases  (see 
Figure  16).  This  implies  that  scalability  is  feasible  for  coarse  grain  simulation  using  a  conser¬ 
vative  protocol. 


78 


5.3  Recommendations  for  Further  Research 

Given  the  strong  conclusions  of  this  research,  several  directions  should  be  taken  to  continue  the 

maturity  of  conservative  simulations  at  AFTT. 

•  Change  the  research  goal  from  partial  ordering  to  stochastic  reliability.  At  this  evolutionary 
stage  of  sophistication  for  modeling  parallel  simulation  performance,  partial  ordering  is  an  un¬ 
realistic  goal.  If  achieved,  the  conditions  would  be  so  restrictive  as  to  negate  the  usefulness  of 
the  findings.  The  goal  must  be  redirected  to  achieving  a  statistically  reliable  model  in  accor¬ 
dance  with  the  conclusion  of  the  previous  section.  This  change  in  direction  would  free  the  re¬ 
searcher  to  open  the  scope  of  investigation  since  the  weight  of  proof  would  be  limited  to  statis¬ 
tical  evidence,  not  exhaustive  trials  of  isolated  cases.  The  real  burden  becomes  to  collect  ade¬ 
quate  statistics  to  form  the  cost  model  and  demonstrate  significant  correlation.  Perhaps  an 
evolving  database  of  statistics  could  realize  guidelines  for  model  choice  based  on  circuit  pa¬ 
rameters  (e.g.  test  bench,  feedback  level,  average  fanout,  etc.). 

•  Expand  the  parameters  considered  in  model  development.  As  a  minimum,  the  GPT  v3.0 
needs  to  maintain  LP  specific  statistics.  Valuable  insight  into  the  relationship  between  parti¬ 
tioning  and  runtime  is  being  lost  to  summary  statistics  which  hide  the  max  function  that  drives 
runtime.  Many  statistics  are  already  gathered  and  just  need  to  be  included  in  the  output. 

•  Enhance  the  search  capabilities  of  GPT  v3.0.  Current  iterative  processing  moves  a  single 
vertex  at  a  time.  This  is  computational  suicide.  Evaluation  of  graph  statistics  is  relatively  ex¬ 
pensive  in  the  current  version  of  GPT.  Moving  a  single  vertex  at  a  time  amounts  to  mowing  a 
soccer  field  a  blade  at  a  time.  More  drastic  steps  are  required  to  adequately  explore  the  huge 
search  space.  The  first  method  to  accomplish  this  would  be  to  turn  the  current  process  into  a 
true  annealing  routine.  It  would  be  better  to  parallelize  a  genetic  algorithm  which  can  consider 
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several  points  at  once  per  processor.  However,  all  techniques  will  be  limited  by  the  current 
expense  of  statistics  evaluation.  Since  each  iteration  of  the  graph  is  some  delta  from  the  previ¬ 
ous  cost  model  value,  some  technique  of  marginal  evaluation  would  lower  the  computational 
costs  and  permit  more  time  for  searching.  Given  that  this  benefit  would  be  realized  many 
thousands  of  times  per  search,  it  should  be  given  significant  consideration.  Unfortunately,  the 
more  complicated  the  model,  the  more  difficult  marginal  accounting  would  be. 

•  Continue  to  enhance  the  simulation  environment.  The  first  item  to  change  must  be  the  data 
type  of  timestamps  and  clocks  in  VSIM  and  SPECTRUM.  Overflow  has  already  caused 
problems  for  researchers  (Kapp,  1993:125).  The  time  datatype  should  be  changed  to  a  float¬ 
ing  point  value.  Additionally,  the  use  of  tvust  as  the  null  message  timestamp  is  extremely  limit¬ 
ing.  Only  when  output  arcs  are  the  hosts  of  the  next  event  is  that  the  correct  timestamp.  If 
some  method  of  identifying  the  event  host  arc  were  accomplished,  a  more  accurate  timestamp 
would  yield  the  benefit  of  lookahead. 

•  Automate  the  simulation  cycle.  Batch  files  and  utility  programs  allowed  over  600  VHDL 
simulations  to  be  considered  in  this  research.  To  support  the  quantity  of  results  required  to 
provide  statistical  significance,  the  entire  simulation  cycle  from  partitioning  through  data  col¬ 
lection  should  be  seamlessly  automated. 

Unfortunately,  the  objectives  of  this  research  were  not  met.  The  cost  models  developed  in 
this  study  demonstrated  no  statistical  improvement  over  Kapp’s  theoretic  model.  Nor  did  the 
models  succeed  in  defining  a  partial  order  of  allocations  for  even  the  limited  case  of  a  fixed  number 
of  LPs  for  a  single  circuit  upon  which  the  model  was  derived.  This  report  concludes  that  research 
has  not  progressed  as  far  as  previously  thought  such  that  previous  findings  aren’t  statistically  sig¬ 
nificant.  What  was  accomplished,  however,  was  the  mathematically  rigorous  investigation  and 
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documentation  of  VHDL  simulation  behavior.  By  maturing  the  tools  developed  in  preceding  re¬ 
search,  additional  instrumentation  allowed  quantitative  exploration  of  the  behavior  that  limits  run¬ 


time.  It  was  this  instrumentation  that  demonstrated  the  decreasing  complexity  of  real  work  as  more 
processors  are  utilized  in  the  simulation.  Similarly,  by  eliminating  feedback,  the  dominating  effect 
of  conservative  blocking  was  tamed.  By  introducing  statistical  methods  into  this  research,  elusive 
theoretic  proofs  were  avoided  in  favor  of  stochastic  legitimacy. 

In  a  word,  this  report  succeeds  in  displaying  our  ignorance  of  simulation  behavior.  How¬ 
ever,  by  accurately  assessing  what  is  known,  more  productive  steps  can  be  taken  into  the  unknown. 
Statistical  techniques  are  excellent  ways  to  overcome  the  inability  to  describe  complex  systems. 
Current  research  of  parallel  discrete  event  simulation  remains  in  Thorndike’s  “organization  and 
prediction”  phase  of  the  scientific  process.  With  the  appropriate  software  and  mathematical  tools 
in  place,  maturation  to  explanation  and  understanding  are  just  a  matter  of  time. 
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